Author: Sree Pradyumna Davuloori, MSc Data Science
The recently concluded 2022 FIFA World Cup final was one of the most memorable finals ever witnessed. The tournament overall was very entertaining, with several surprising results and intriguing tactical trends. Some of the tactical trends are explained exceptionally by Mark Carey of The Athletic in [1].
From the referenced article by Mark Carey, one tactical trend that was particularly interesting was that teams who kept less possession of the ball compared to their opponent in a game scored a higher number of points per game. This shows that the teams who ceded possession of the ball to their opponent and focused on counter-attacking reaped rewards. Morocco and Japan were good examples of this.
The world cup also displayed a wide gamut of playing styles including the pass-to-death positional play approach of Spain that was ultimately unsuccessful, Morocco’s counter-attacking tactics and the flexibility of the two best teams Argentina and France.
However, international football is almost a different game when compared to club football. International teams and their managers have substantially less time on the training field to prepare and practice their tactics. Therefore, international teams are far less progressive compared to their club football counterparts and generally tend to focus on pragmatic football that gets the job done.
The focus of this paper will be on English football, specifically the English Premier League. The proposal hypothesized that English football has undergone an evolution in its playing style after influential coaches like Pep Guardiola and Jurgen Klopp joined English clubs around 2016. In the proposal, there was exploratory analysis done on 3 major questions:
i. Are goalkeepers increasingly relying on short passes from Goal Kicks?
The analysis of data showed that the median length of goal kicks has reduced every season starting from 2017 and ending in 2022. The mean and median length of passes by the goalkeepers from open play also decreased.
ii. Are Goalkeepers sweeping up long balls behind their defensive line?
The Average distance of defensive actions metric was utilized to analyse whether goalkeepers are stepping out and sweeping long balls. The analysis showed that there was an increase in teams’ usage of sweeping as a tactic.
iii. Are teams making more short passes?
This was analysed using the mean number of short passes made each season, from 2017 until 2022. The result showed fluctuations from season to season and needs further detailed analysis.
Therefore, through a birds-eye view of the data and some initial analysis, some evidence was gained to suggest that changes are occurring in the English game that can be further explored and explained through a detailed analysis.
The original aim of the paper was to compare the five-year periods before and after the introduction of the two influential coaches Pep Guardiola and Jurgen Klopp, and contrast the state of English football before and after they set foot in England.
Unfortunately, Fbref which is the data source for this paper has detailed data starting only from the 2017-2018 season.
Therefore, this paper will aim to produce a comprehensive report on the evolution of English football over the past few years, with a focus on the seasons between 2017-18 and 2021-2022, the last fully completed season.
The report will focus on analysing the evolution of four main facets of football:
i. Goalkeeping
ii. Defence
iii. Passing and Possession of the ball
iv. Pressing
Tactically, these are the four most meticulously planned phases of the game by most modern football managers:
The evolution of these four phases will be analysed through a series of questions or hypotheses. Data visualisations will be utilised to present the analysis and the answer to the questions.
The evolution of the attacking phase of the game i.e scoring goals is not being analysed in detail. Scoring goals seems important, goals win games! But the attacking phase of the game is the most laissez-faire phase and the players have a high degree of freedom. This is the area that requires the most creativity and spontaneous innovation by the players and is therefore the phase that is least tinkered with, by the managers. Thierry Henry who played under Pep Guardiola’s coaching at Barcelona has talked about how Guardiola allowed players a lot of freedom after they got the ball to the final third of the pitch which is when the team is in the attacking phase. But till the ball reached the final third, Guardiola put clear rules in place. [2]
It is prudent to discuss the caveats to the analyses presented in this report:
Correlation does not equate to causation, explained with real-world examples here [3]. The core hypothesis of this report is to study the influence of Guardiola and Klopp in pioneering an evolution in English football after their arrival. But, the two managers are not aiming to transform English football, they just aim to win games and trophies for their respective clubs. So, when this report makes a statement like “Strikers are increasingly scoring more goals from around the penalty spot”, it does not mean that the change is completely down to Guardiola or Klopp.
Football is a complicated sport in which not everything can be explained through statistics.
The statistics used in the report may not perfectly capture what happens on the field. For example, there is a debate about the usage of the xG or Expected goals metric to explain the likelihood that a player should have scored from a particular chance. The many qualities of a good defender like jockeying to delay an attack, pushing the attacker into a less dangerous position, etc are not captured through metrics that capture the number of tackles.
There are some excellent articles online, that have talked about how football has changed over time and in the past decade.
An article by an unknown author in Soccerblade [4], details five factors due to which football has evolved over time, including technology, media, social media, tactics and style of play. The article talks about how sports science has entered football and has transformed the fitness levels of modern athletes through recovery methods like physical therapy, ice baths, cryotherapy and through data-driven analysis of their workloads for injury prevention. The article also gives a brief history of tactical changes in the game. The 2-2-6 formation was used in the early 1900s, which is in a way making a comeback as the in-possession system used by modern managers to create space in attack. The Total Football system pioneered by famous Dutchmen Johan Cruyff and Rinus Michels which laid the foundations for Pep Guardiola’s style of play, the Catenaccio implemented by Helenio Herrera which is an ultra-defensive style of play, Pep Guardiola’s tiki-tika and its resounding success at Barcelona are covered briefly. The article mentions the influence of Pep Guardiola and Barcelona on the modern game.
Daniel Taylor’s article in The Athletic [5] details the influence of Pep Guardiola on lower-league teams in English football. The article by Taylor is the closest work to this report. But this report will take a substantially more data-driven approach to examine the broader evolution of the English game over the past few years, coinciding with Guardiola and Klopp’s arrival in English football. Whereas Taylor’s article focuses on examining Guardiola’s direct influence on managers and teams in the lower leagues of English football.
Taylor interviews managers like Ian Evatt of Bolton Wanderers and Ian Burchnall of Notts County (at the time) who express their desire to play in an attractive possession-based style of play focused on winning in style. The article notes other managers operating in the lower leagues like Ben Garner, Rob Edwards and Liam Manning, who are attempting to espouse the traditional long-ball direct approach played at their teams to employ more of Guardiola’s style. Daniel Taylor shows that, through data collected by Opta, the number of long balls has decreased in all the top 4 leagues of English football. The most fascinating part of the article is the coverage of how even non-league teams are adapting the possession-based style favoured by Guardiola. This includes teams like Dorking Wanderers and Gateshead, operating in the National Leagues, who are playing an expansive possession style of football with success and enthralling fans. Ian Evatt, and surprisingly Wayne Rooney, formerly of Manchester United, are the only ones in the article who directly credit Guardiola as their influence, but it is clear to see that Guardiola has inspired many managers in lower leagues and non-league football to adopt his principles.
Bill Connelly’s article in ESPN [6], published in 2020, details the changes in football over the past decade. There are two interesting findings, all gleaned through a statistical approach by Connelly, looking at data from Opta. One, football has seen an increasing focus on efficiency, with a rise in possession, a rise in pass completion rate, a rise in more patient attacks and a drop in the count of possession changing feet. The number of tackles and fouls decreased, but the number of dribbles increased. This points to an increase in the technical quality of the players, who are therefore better at dribbling. Two, there has been an increased focus on pressing, in the 2010s decade, with substantial increases in possessions won in the final third and ball recoveries, both of which measure high pressing.
The English Premier League themselves publish an article every year on their website, that details the trends seen that year [7]. The 2021-2022 season trends showed a substantial increase in the number of possessions won in the final third, a clear sign of effective pressing strategies combined with more teams playing out from the back and risking losing possession in their own defensive third. The season also showed an increase in the number of fast, direct attacks, which could be the result of teams adopting to their opponent’s patient possession by staying compact and springing a fast direct attack when they get the ball.
As described in the proposal, this paper will utilize data provided by Fbref [8]. Fbref partners with Opta, which collects detailed statistics about football.
The paper will utilise this data from Fbref:
Picture courtesy: Fbref
For example: The first row shows how many tackles, blocks, interceptions, etc were recorded against Brentford by their opponents, across the entire 2021-2022 season.
Picture courtesy: Fbref
For example, sorted in descending order of tackles in the attacking third, the picture shows Bernardo Silva, Martin Odegaard and Marc Cucurella as the top three players with the most tackles in the attacking third.
This section will give a brief introduction to some of the important concepts of football and the metrics used in this report. Readers with thorough knowledge of football can skip this section of the report.
i. Four main phases in football: In a football game, there are four phases of play. Each team is at one of these phases of play, at any moment in time. They are described in the next four points.
ii. Defensive phase: This phase occurs when a team is without the ball because their opponent has it. Different teams employ different tactics in this phase when they are without the ball, some teams look to aggressively pressure the opponent to win the ball back quickly and launch their own attacks. Others look to station themselves compactly, intending to lure the opponent into leaving some space that they can exploit once the ball is won back, this is called a counter-attack.
iii. Offensive phase: This phase occurs when a team is in possession of the ball. Some managers like Louis Van Gaal have explained their further breakdown of this phase into: the construction phase when the build-up of the attack has just started around the team’s own defensive third, settled possession of the ball that occurs when the team is in control of the ball and has settled into their possession structure, chance creation phase which occurs when the team has entered the dangerous areas near the opposition’s goal and is looking to create a scoring chance, chance finishing phase which is when the team’s players look to finish a chance.
iv. Defensive transition phase: This phase occurs as soon as the team has lost the ball. This phase can be dangerous because the team has just lost the ball and is disorganised. The proposal explained that the fundamental principle in any football system is to spread across the pitch when the team has the ball, so that the opponent can be pulled apart and chances can be created through the exploitation of space. Therefore, as the team has spread all across the pitch, the defensive transition phase is dangerous because the ball has just been lost and the opposition can counter-attack to create a chance. This is the phase into which a lot of meticulous planning has been put in by modern managers, especially in the last decade. Teams employ one of two strategies in this phase, they either immediately put pressure on the opponent to win the ball back, which is called counter-pressing. Or they immediately retreat into their compact defensive shape so that the opponent cannot counter-attack effectively.
v. Offensive transition phase: This phase occurs as soon as the ball has been won back from the opponent. Teams usually launch a quick counter-attack to utilise the space left by the opponent and create a chance.
vi. Defensive, Middle and Final thirds of the pitch: The football pitch has three imaginary divisions that can separate what a team does in that area. The defensive third is the area of the pitch closest to the goal a team is protecting. The attacking or final third is the area of the pitch closest to the goal a team is attacking to try and score goals. The middle third is between the defensive and attacking thirds. Explained well through pictures here [9].
Picture courtesy: https://www.rookieroad.com
Picture courtesy: https://www.rookieroad.com
vii. Possession: The act of having the ball is called possession. There is a metric called possession that shows how much of the ball a team has over a whole game. This metric is calculated through different methods by different stats companies. But Opta, the stats provider for Fbref, calculates this as a ratio of the number of completed passes made by a team divided by the number of completed passes made in the entire game. Therefore, possession is not calculated as the amount of time spent on the ball, but is rather calculated through the number of passes made.
viii. Pressing: Pressing occurs when a team actively pressures the opponent’s players to win the ball back. Teams employ pressing instead of passively staying in their defensive shape waiting for their opponent to make a mistake. However, pressing in the modern game is meticulously designed by managers, therefore teams do not aimlessly run around to get the ball back, this wastes energy and achieves the opposite effect by leaving gaps that the opponent then exploits. Pressing is strategically employed in certain phases of the game and areas of the pitch. High-pressing, which is done near the opponent’s goal is one of the biggest changes in the modern game. High-pressing is profitable because the ball is won back near the goal being attacked (i.e the opponent’s goal), therefore the team is only a few correct passes away from creating a good chance to score. High-pressing should not be confused with counter-pressing. Counter-pressing is the act of pressuring the opponent as soon as the ball is lost so that the team can quickly regain the ball. Counter-pressing is not specific to any area of the pitch. Pressing is explained well here by Spielverlagerung [10]
ix. Sweeping: Sweeping ties into the concept of high-pressing. To press the opponent high up the pitch, teams usually push all their players high up the pitch, including the defenders, to get the best possible chance of overwhelming the opponent near their own goal and get the ball back in an advantageous area. This also happens in a very settled possession sequence, when the ball is under the team’s control and therefore the team’s players have all moved up the pitch in hopes of creating good chances. When either of the two situations happens, there is a huge amount of space left behind the team’s defence. A pass over the top of the defence or through the defence can eliminate most of the team’s players and let the opponent’s strikers or forwards through on goal, one versus one with the goalkeeper. Sweeping is the action of a goalkeeper stepping out of their goal to clear the ball when the defenders have been bypassed. Traditionally, goalkeepers were never comfortable doing this, and even today a lot of goalkeepers are uncomfortable stepping out to clear danger. However, players like Manuel Neuer have pioneered sweeping and the trend has been continued by keepers like Ederson. Sweeping is slowly becoming an essential part of modern progressive football tactics.
x. Average Length of goal kicks by the goalkeeper: This is a metric that details the length of goal kicks by a goalkeeper. In 2019, the laws were altered to allow the team’s outfield players to receive goal kicks inside their own penalty area, and this has led to fascinating use cases of goal kicks. This is explained brilliantly by Michael Cox of The Athletic here [11]. The average length of goal kicks metric is a great way to measure how a team and its manager approach building their possession sequences. A team that focuses on possession almost always makes their first few passes short, usually from the goalkeeper to the defenders. A team focused on direct attacks uses the goal kick to launch the ball forward to usually a tall forward, aiming to get near the opponent’s goal quickly.
xi. Average Length of passes by the Goalkeeper in open play: This metric details the length of passes made by the goalkeeper in open play. Open play does not include goal kicks.
xii. Passes Launched percentage and Goal kicks Launched percentage: In football, a pass being launched means that the pass has been kicked long. Therefore these two metrics measure the percentage of times the goalkeeper has kicked the ball long, from open play and goal kicks respectively.
xiii. Average Distance of Defensive actions: This metric details the distance from the goal where the goalkeeper commits defensive actions, which includes claiming the ball or sweeping it i.e clearing the ball to avert danger. The higher this distance is, the more proactive the goalkeeper is in coming out of their area to prevent dangerous situations.
xiv. Defensive actions outside the area: This metric counts the number of times a goalkeeper performs a defensive action outside the penalty area. This is a good proxy metric for measuring the amount of aggressive sweeping employed by a team through their goalkeeper.
xv. Tackles in the final third: This metric measures the number of tackles made by a team or a player in the final/attacking third. As discussed previously, this is the area of the pitch closest to a goal being attacked and is therefore a great place to win the ball back with the potential to create a scoring chance.
xvi. xG or Expected goals: Expected goals or the xG metric shows the likelihood that a given shot will end up being a goal. xG ranges from 0 meaning a zero percent chance of being a goal to 1 meaning a hundred percent chance of being a goal. It comes with a couple of important caveats: xG does not consider the shooting player’s finishing quality or technique or even the kind of shot they will take, xG also does not account for the goalkeeper’s ability to save the shot. What xG does account for is the situation the shot is taken in including the location and angle from where the shot will be taken, the body part of the player being used to take the shot, the type of pass that resulted in this shot and the type of possession that resulted in this shot. xG can be a flawed metric given its caveats. However, if used for what it was designed to indicate, it can be effective in understanding a team’s ability to generate high-quality chances over a substantial period of games like 10 games to a season, and it can quantify an individual player’s ability to convert chances into goals.
xvii. Non-penalty xG: Penalties have a high probability of being converted into a goal, being a one-versus-one duel with the goalkeeper taken just 18 yards away from the goal. Penalties have been assigned a constant xG value of about 0.76 (varies slightly between data providers). Therefore, excluding penalties from the xG calculation gives a better indication of a team’s ability to create good scoring chances and a player’s ability to convert said chances.
xviii. Shot Creating actions: A shot-creating action is anything done on a football field that immediately leads to a shot on goal. This can include a pass, a dribble, a set piece like a free kick or corner, a foul which leads to a shot from the resulting set piece and even a tackle to win the ball back which leads to a shot. Technically, shot-creating actions include the two previous actions directly before the shot. Therefore, two separate players in a team can get credit for the shot creating action for the same shot.
xix. Tackles: A tackle is a direct duel by the defensive team’s player with the opponent’s player on the ball. A tackle can be attempted without being successful, which means the defensive player tried to win the ball back but failed. Fbref data contains metrics measuring both tackles attempted and tackles completed.
xx. Progressive Passes: A progressive pass is any completed pass that travels more than 40 yards up the pitch. Passes made from inside the team’s defensive third are excluded. The intention of using this metric is to exclude clearances made from near the team’s own goal and to only record passes that intentionally move the team up the pitch and towards the attacking third.
xxi. Passes entering the final third: A final third entering pass is any completed pass that reaches the final third of the pitch.
xxii. Through Balls: A through ball is any pass that is that sends a teammate into open space towards the goal being attacked. While not always the case, through balls have a high chance of leading to a dangerous scoring chance. The downside is that they are extremely difficult to make, needing eye of the needle passes.
In this section, steps will be taken to prepare the data for analysis. The process will include:
Importing the necessary python libraries:
#Importing libraries:
import pandas as pd
import numpy as np
import requests
import bs4 as bs
from selenium import webdriver
from urllib.error import HTTPError
import time
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
from scipy import stats
import unittest
Brief reasoning for the usage of the libraries:
Now, importing the CSV files from local machine. These CSV files contain the data scraped already.
# Importing the data from local machine:
standings_data = pd.read_csv('standings_data.csv')
standard_stats_data = pd.read_csv('standard_stats_data.csv')
goalkeeping_merged_data = pd.read_csv('goalkeeping_merged_data.csv')
shooting_merged_data = pd.read_csv('shooting_merged_data.csv')
passing_merged_data = pd.read_csv('passing_merged_data.csv')
defensive_actions_data = pd.read_csv('defensive_actions_data.csv')
possession_data = pd.read_csv('possession_data.csv')
possession_data.head(3)
| Unnamed: 0 | squad | players_used | possession | 90s_played | touches | def_penalty_area_touches | def_3rd_touches | middle_3rd_touches | attack_3rd_touches | ... | dribbles_successful | dribbles_attempted | dribbles_success_percent | dribbles_miscontrolled | dribbles_dispossessions | passes_received_successfully | progressive_passes_received | season_name | name | ranking | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Arsenal | 23 | 57.8 | 14.0 | 9310 | 705 | 2351 | 4321 | 2716 | ... | 127 | 270 | 47.0 | 220 | 161 | 6437 | 468 | 2022-2023 | possession_data | 1 |
| 1 | 12 | Manchester City | 21 | 66.1 | 14.0 | 11302 | 735 | 2638 | 5466 | 3269 | ... | 111 | 240 | 46.3 | 148 | 112 | 8770 | 556 | 2022-2023 | possession_data | 2 |
| 2 | 14 | Newcastle Utd | 24 | 50.7 | 15.0 | 8542 | 857 | 2661 | 3471 | 2493 | ... | 80 | 233 | 34.3 | 206 | 138 | 5263 | 441 | 2022-2023 | possession_data | 3 |
3 rows × 22 columns
Dropping the unnecessary column named 'Unnamed: 0':
#Execute to drop unwanted columns:
standings_data.drop('Unnamed: 0',axis = 1 ,inplace = True)
standard_stats_data.drop('Unnamed: 0',axis = 1 ,inplace = True)
goalkeeping_merged_data.drop('Unnamed: 0',axis = 1 ,inplace = True)
shooting_merged_data.drop('Unnamed: 0',axis = 1 ,inplace = True)
passing_merged_data.drop('Unnamed: 0',axis = 1 ,inplace = True)
defensive_actions_data.drop('Unnamed: 0',axis = 1 ,inplace = True)
possession_data.drop('Unnamed: 0',axis = 1 ,inplace = True)
Two columns will be added to enhance the analysis:
For both these tasks, simple functions were created and then applied on all the data frames.
Create function to add the new season column. Then apply it on all the data frames:
# Create function to add a season column, by splitting the season_name column
def split_season(data_frame):
data_frame['season']= data_frame['season_name'].str.split('-', expand = True)[0].astype('int')
return(data_frame)
# Apply function on all data frames
standings_data = split_season(standings_data)
standard_stats_data = split_season(standard_stats_data)
goalkeeping_merged_data = split_season(goalkeeping_merged_data)
shooting_merged_data = split_season(shooting_merged_data)
passing_merged_data = split_season(passing_merged_data )
defensive_actions_data = split_season(defensive_actions_data)
possession_data = split_season(possession_data)
possession_data['season'].head(3)
0 2022 1 2022 2 2022 Name: season, dtype: int64
Create function to add the table position column. Then apply it on all the data frames:
#Create a column to indicate Top 6, mid table and Bottom 5 teams in a table position category column:
def create_table_position(df):
if df['ranking'] <= 6:
df['table_position_category'] = 'Top 6'
elif df['ranking'] > 6 and df['ranking'] <= 15:
df['table_position_category'] = 'Mid Table'
else:
df['table_position_category'] = 'Bottom 5'
return(df)
# Apply function on all data frames to create the table position column.
standard_stats_data = standard_stats_data.apply(create_table_position, axis = 'columns')
goalkeeping_merged_data = goalkeeping_merged_data.apply(create_table_position, axis = 'columns')
shooting_merged_data = shooting_merged_data.apply(create_table_position, axis = 'columns')
passing_merged_data = passing_merged_data.apply(create_table_position, axis = 'columns')
defensive_actions_data = defensive_actions_data.apply(create_table_position, axis = 'columns')
possession_data = possession_data.apply(create_table_position, axis = 'columns')
possession_data['table_position_category'].head(3)
0 Top 6 1 Top 6 2 Top 6 Name: table_position_category, dtype: object
Data visualisations will be the main source of analysis for the report. Therefore, fixed colour palettes are created for each categorical variable's values. This helps in maintaining a constant colour for say, the Top 6 teams in each graph.
# Creating fixed colour palettes:
palette_table_position = "husl"
palette_possession = "hls"
palette_pressing = "Set2"
With the increased focus on keeping safe possession, most managers task their goalkeepers to play short passes to their defenders from goal kicks.
Checking the change in Average Length of goal kicks, through a line graph:
plt.figure(figsize = (10,8))
gk_plot_1 = sns.lineplot(data = goalkeeping_merged_data.query("season_name != '2022-2023'"), x = 'season', y = 'goal_kicks_avg_length', \
ci = None, estimator = 'mean', linewidth = 5)
plt.title('Average Length of Goal Kicks per season', fontsize = 20, loc = 'center',fontweight = 'bold',pad = 15)
plt.xlabel("Season", fontsize = 15, labelpad = 15)
plt.ylabel("Average Length of Goal Kicks", fontsize = 15, labelpad = 15)
gk_plot_1.set_xticks([2017,2018,2019,2020,2021])
gk_plot_1.set(ylim = (41,58))
plt.show()
The average length of goal kicks has decreased substantially overall from 2017 to 2021, with a gradual decrease every year. Adding some context to this, in 2017, goal kicks went about 58 yards on average, which is close to half the length of the football pitch, meaning that goal kicks ended up around the midfield area (on average). By 2021, this length dropped to about 41 yards, meaning that the goal kicks remain inside the defensive third of the team (on average).
This shows that the trend of goalkeepers just launching every ball forward to a tall forward is disappearing. Teams now focus on building attacks by keeping the ball on the ground, starting from the goalkeeper.
Now, checking the average length of goal kicks, separated by Table Position Category for:
plt.figure(figsize = (10,8))
gk_plot_2 = sns.lineplot(data = goalkeeping_merged_data.query("season_name != '2022-2023'"), x = 'season', y = 'goal_kicks_avg_length',hue = 'table_position_category', \
ci = None, estimator = 'mean', linewidth = 5, palette = palette_table_position)
plt.title('Average Length of Goal Kicks per season by Table Position', fontsize = 20, loc = 'center',fontweight = 'bold',pad = 15)
plt.xlabel("Season", fontsize = 15, labelpad = 15)
plt.ylabel("Average Length of Goal Kicks", fontsize = 15, labelpad = 15)
gk_plot_2.set_xticks([2017,2018,2019,2020,2021])
gk_plot_2.set(ylim = (30,62))
plt.legend(title = 'Table Position:')
plt.show()
Overall, the average pass lengths from goal kicks have decreased for all categories of teams.
The top 6, expectedly, keep their goal kicks short by focusing on playing out from the back through their goalkeepers and defenders. The lengths of goal kicks by the Top 6 have gradually reduced from 2017 to 2021, with a slight increase in 2020.
The Mid Table and Bottom 5 teams have seen a substantial reduction in their length kicked from goal kicks. From 2017 till 2019 the Mid Table and Bottom 5 teams kicked their goal kicks about the same lengths.
But in 2020 and 2021, the Mid table teams have separated themselves from the Bottom 5 teams in terms of length kicked from goal kicks. The exact lengths of goal kicks by the Mid Table and Bottom 5 teams are also influenced by the identity of the teams in Mid Table and Bottom 5 positions that season. However, the overall trend matters and that shows a decrease in the length of goal kicks.
When a possession runs into a cul-de-sac, teams often go back to their goalkeeper to safely restart the possession sequence. It is interesting to check what the goalkeeper does when the ball reaches them in such a situation, from open play.
Checking how average pass lengths have changed from Open Play:
plt.figure(figsize = (10,8))
gk_plot_3 = sns.lineplot(data = goalkeeping_merged_data.query("season_name != '2022-2023'"), x = 'season', y = 'pass_avg_length',\
ci = None, estimator = 'mean', linewidth = 3.75)
plt.title('Average Length of Passes from Open Play per season', fontsize = 20, loc = 'center',fontweight = 'bold',pad = 15)
plt.xlabel("Season", fontsize = 15, labelpad = 15)
plt.ylabel("Average Length of Passes", fontsize = 15, labelpad = 15)
gk_plot_3.set_xticks([2017,2018,2019,2020,2021])
gk_plot_3.set(ylim = (34,42))
plt.show()
The average length of passes from Open Play has reduced by about 7 yards when comparing the 2017 season to 2021.
Now, separate the average length of open play passes by table position category.
plt.figure(figsize = (10,8))
gk_plot_4 = sns.lineplot(data = goalkeeping_merged_data.query("season_name != '2022-2023'"), x = 'season', y = 'pass_avg_length', hue = 'table_position_category',\
ci = None, estimator = 'mean', linewidth = 5, palette = palette_table_position)
plt.title('Average Length of Passes from Open Play by Table Position', fontsize = 20, loc = 'center',fontweight = 'bold',pad = 25)
plt.xlabel("Season", fontsize = 15, labelpad = 15)
plt.ylabel("Average Length of Passes", fontsize = 15, labelpad = 15)
gk_plot_4.set_xticks([2017,2018,2019,2020,2021])
gk_plot_4.set(ylim = (26,46))
plt.legend(title = 'Table Position:')
plt.show()
The length of passes from Open Play follows a similar pattern to that of the goal kicks seen in Question 1.
f, ax = plt.subplots(1,2, figsize = (18,10), sharey = False)
gk_plot_5 = sns.lineplot(data = goalkeeping_merged_data.query("season_name != '2022-2023'"), x = 'season', y = 'passes_launched_percent', hue = 'table_position_category', ci = None, ax = ax[0],linewidth = 5, palette = palette_table_position, estimator = 'mean')
gk_plot_6 = sns.lineplot(data = goalkeeping_merged_data.query("season_name != '2022-2023'"), x = 'season', y = 'goal_kicks_launched_percent', hue = 'table_position_category', ci = None, ax = ax[1],linewidth = 5, palette = palette_table_position, estimator = 'mean')
gk_plot_5.set(xticks = [2017, 2018, 2019, 2020, 2021])
gk_plot_5.set_title('Percentage of Open Play passes Launched:',fontsize = 20, loc = 'center',fontweight = 'bold',pad = 15)
gk_plot_5.set_xlabel('Season', fontsize = 15, labelpad = 15)
gk_plot_5.set_ylabel('% of Open Play passes launched', fontsize = 15, labelpad = 15)
gk_plot_5.legend(title = 'Table Position:')
gk_plot_6.set(xticks = [2017, 2018, 2019, 2020, 2021])
gk_plot_6.set_title('Percentage of Goal Kicks Launched:',fontsize = 20, loc = 'center',fontweight = 'bold',pad = 15)
gk_plot_6.set_xlabel('Season', fontsize = 15, labelpad = 15)
gk_plot_6.set_ylabel('% of Goal Kicks launched', fontsize = 15, labelpad = 15)
gk_plot_6.legend(title = 'Table Position:')
plt.show()
Analysing the two graphs:
i. The plot for % of Open Play passes launched per season shows that:
ii. The plot for % of Goal Kicks launched per season:
The Mid Table and Bottom 5 teams were launching i.e kicking long, a whopping 90% of their goal kicks in the 2017 season. As seasons have progressed, this % has decreased substantially for both sets of teams. The Mid Table teams long kick only about 60% of their goal kicks by the 2021 season. Whereas the Bottom 5 teams kick about 75% of their goal kicks long.
Surprisingly, Top 6 teams were launching more than 50% of their goal kicks in the 2017 season. But this fell to about 40 percent by the 2021 season.
First, merge the possession stats for each team across a season, which is present in the possession_data table, into the goalkeeping data frame. Then, make a new data frame using the query method, by excluding the ongoing 2022-2023 season data.
goalkeeping_merged_data = pd.merge(goalkeeping_merged_data, possession_data.loc[:, ['season_name', 'squad', 'possession']], on = ['squad', 'season_name'])
goalkeeping_merged_data_past_5 = goalkeeping_merged_data.query("season_name != '2022-2023'")
# Creating an lmplot:
gk_plot_7 = sns.lmplot(data = goalkeeping_merged_data_past_5 , x = 'possession', y = 'pass_avg_length', fit_reg= True, scatter_kws= {'s': 50},
height = 6, aspect = 1.5)
plt.title('Relationship between Possession and Average Pass Length', fontsize = 20, loc = 'center',fontweight = 'bold',pad = 15)
plt.xlabel("Possession", fontsize = 15, labelpad = 15)
plt.ylabel("Average Pass Lengths", fontsize = 15, labelpad = 15)
gk_plot_7.set(ylim = (22,56))
gk_plot_7.set(xlim= (33,74))
plt.show()
It looks like there is a strong, negative linear relationship between possession of the ball and average pass length. Possession increases as the average pass length decreases.
Teams that aim to keep possession of the ball are instructing their goalkeeper to look for shorter passes. A shorter pass to a teammate, like the central defender has a much higher chance of being completed and therefore leading to possession being kept. This does come with the risk that if there is an error when a short pass is made, the team risks losing the ball near their own goal and gifting their opponent a potentially dangerous scoring chance. However, most managers are happy with the trade-off that this carries.
Checking the correlation between possession of the ball and average length of passes:
print(f"The Correlation between Possession and Average Pass Length is: {goalkeeping_merged_data_past_5['pass_avg_length'].corr(goalkeeping_merged_data['possession'])}")
The Correlation between Possession and Average Pass Length is: -0.8213930606391877
Correlation corroborates the strong inverse/negative relationship between Possession and Average Pass Length.
The plot can be improved by adding the information about whether a team, over the course of a season is a:
First, a new column will be created to categorise a team as High, Mid or Low possession team. To do this, the following steps are taken:
#Creating the possession category column:
possession_75_percentile = pd.DataFrame(goalkeeping_merged_data.groupby('season')['possession'].quantile(0.75))
possession_75_percentile['possession_50th_percentile'] = goalkeeping_merged_data.groupby('season')['possession'].quantile(0.5)
possession_75_percentile = possession_75_percentile.reset_index(level = 0 )
possession_75_percentile.rename({'possession': 'possession_75th_percentile'}, axis= 1, inplace= True)
goalkeeping_merged_data = pd.merge(goalkeeping_merged_data, possession_75_percentile.loc[:, ['season', 'possession_75th_percentile', 'possession_50th_percentile']], on = ['season'])
def high_possession_category(goalkeeping_merged_data):
if goalkeeping_merged_data['possession'] > goalkeeping_merged_data['possession_75th_percentile']:
goalkeeping_merged_data['high_possession_team'] = 'High Possession Team'
elif goalkeeping_merged_data['possession'] >= goalkeeping_merged_data['possession_50th_percentile'] and goalkeeping_merged_data['possession'] <= goalkeeping_merged_data['possession_75th_percentile']:
goalkeeping_merged_data['high_possession_team'] = 'Mid Possession Team'
else:
goalkeeping_merged_data['high_possession_team'] = 'Low Possession Team'
return(goalkeeping_merged_data)
goalkeeping_merged_data = goalkeeping_merged_data.apply(high_possession_category , axis = 'columns')
goalkeeping_merged_data.head(3)
| squad | goals_conceded | goals_conceded_per90 | shots_on_target_faced | saves | saves_percent | clean_sheets | clean_sheets_percent | season_name | name_x | ... | defensive_actions_outside_area_per90 | avg_dist_of_defensive_actions | name_y | ranking | season | table_position_category | possession | possession_75th_percentile | possession_50th_percentile | high_possession_team | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Arsenal | 11 | 0.79 | 37 | 27 | 73.0 | 7 | 50.0 | 2022-2023 | goalkeeping_stats_data | ... | 1.07 | 15.2 | advanced_goalkeeping_data | 1 | 2022 | Top 6 | 57.8 | 52.85 | 50.0 | High Possession Team |
| 1 | Manchester City | 14 | 1.00 | 38 | 25 | 68.4 | 6 | 42.9 | 2022-2023 | goalkeeping_stats_data | ... | 1.50 | 17.2 | advanced_goalkeeping_data | 2 | 2022 | Top 6 | 66.1 | 52.85 | 50.0 | High Possession Team |
| 2 | Newcastle Utd | 11 | 0.73 | 54 | 43 | 81.5 | 7 | 46.7 | 2022-2023 | goalkeeping_stats_data | ... | 2.40 | 19.3 | advanced_goalkeeping_data | 3 | 2022 | Top 6 | 50.7 | 52.85 | 50.0 | Mid Possession Team |
3 rows × 42 columns
Recreate the data frame by excluding the ongoing 2022-2023 season's data:
goalkeeping_merged_data_past_5 = goalkeeping_merged_data.query("season_name != '2022-2023'")
Now check the relationship between Pass Length vs Possession, accounting for High, Mid & low possession teams:
gk_plot_8 = sns.lmplot(data = goalkeeping_merged_data_past_5 , x = 'possession', y = 'pass_avg_length', hue = 'high_possession_team',
fit_reg= False, height = 6, aspect = 1.5, palette = palette_possession, scatter_kws= {'s': 60})
plt.title('Possession vs Avg Pass Length by Possession category', fontsize = 20, loc = 'center',fontweight = 'bold',pad = 15)
plt.xlabel("Possession",fontsize = 15, labelpad = 15)
plt.ylabel("Average Pass Lengths",fontsize = 15, labelpad = 15)
plt.legend(title = 'Possession Category:')
gk_plot_8.legend.remove()
plt.show()
The graph shows that:
The above plot throws up questions about whether any patterns can be spotted in the teams based on their ranking or position in the table.
Making a plot for Possession vs Average Pass Lengths, separated by Table Position:
gk_plot_9 = sns.lmplot(data = goalkeeping_merged_data , x = 'possession', y = 'pass_avg_length', hue = 'table_position_category',
fit_reg= False, height = 6, aspect = 1.5, palette = palette_table_position, scatter_kws= {'s': 60})
plt.title('Possession vs Avg Pass Length by Table Position', fontsize = 20, loc = 'center',fontweight = 'bold',pad = 15)
plt.xlabel("Possession",fontsize = 15, labelpad = 15)
plt.ylabel("Average Pass Lengths",fontsize = 15, labelpad = 15)
#gk_plot_9.set(ylim = (22,56))
#gk_plot_9.set(xlim= (33,74))
plt.legend(title = 'Table Position:')
gk_plot_9.legend.remove()
plt.show()
Analysing the above graph:
First, checking the average of number of defensive actions outside the area by teams of different possession categories:
plt.figure(figsize = (10,8))
gk_plot_10 = sns.barplot(data = goalkeeping_merged_data_past_5, x = 'defensive_actions_outside_area', y = 'high_possession_team' ,orient = 'h' ,estimator = np.mean,
color = 'purple')
plt.title('Defensive actions outside the area by Possession category', fontsize = 20, loc = 'center',fontweight = 'bold',pad = 15)
plt.xlabel("Average Defensive Actions outside penalty area",fontsize = 15, labelpad = 15)
plt.ylabel("Possession Category",fontsize = 15, labelpad = 15)
plt.yticks(fontweight = 'bold')
plt.show()
It is interesting to note that the average number of defensive actions outside the area is higher, albeit just marginally, for the Low possession teams compared to the high possession teams. However, there are a couple of caveats to this:
Let us look at the season wide trends for Average Distance of Defensive actions:
plt.figure(figsize = (10,8))
gk_plot_11 = sns.lineplot(data = goalkeeping_merged_data_past_5, x = 'season', y = 'avg_dist_of_defensive_actions', ci = None, estimator = np.mean, linewidth = 5)
gk_plot_11.set(xticks = [2017, 2018, 2019, 2020, 2021])
plt.title('Average Distance of defensive actions per Season', fontsize = 20, loc = 'center',fontweight = 'bold',pad = 15)
plt.xlabel("Season",fontsize = 15, labelpad = 15)
plt.ylabel("Average Distance of Defensive actions",fontsize = 15, labelpad = 15)
plt.show()
The average distance of defensive actions by the goalkeeper has had an overall increase across the five seasons, albeit with reductions in 2019 and 2020. This shows that goalkeepers are stepping out to avert danger. However, it is tricky to give context to the numbers in the above graph.
As the above graph was tricky to contextualise, it is prudent to check if separating the graph by the possession category of teams gives clearer insights:
plt.figure(figsize = (15,8))
gk_plot_12 = sns.lineplot(data = goalkeeping_merged_data_past_5, x = 'season', y = 'avg_dist_of_defensive_actions', ci = None,linewidth = 5 ,estimator = np.mean, hue = 'high_possession_team', palette = palette_possession)
gk_plot_12.set(xticks = [2017, 2018, 2019, 2020, 2021])
plt.title('Average distance of defensive actions per season by Possession category', fontsize = 20, loc = 'center',fontweight = 'bold',pad = 15)
plt.xlabel("Season",fontsize = 15, labelpad = 15)
plt.ylabel("Average Distance of Defensive actions",fontsize = 15, labelpad = 15)
plt.legend(title = 'Possession Category:')
plt.show()
Lots of fluctuations in the plot and that means that we cannot definitively assert that all teams are employing sweeping more. It can only be concluded that teams who keep a high amount of possession of the ball do instruct their keeper to sweep up long balls, compared to teams who keep lower amounts of possession. This makes sense tactically as well, because these possession heavy systems commit a lot of their players forward, leaving themselves vulnerable to a ball behind their defence.
This is a pertinent question. As discussed previously, teams who look to employ high pressing, push their defenders very high, sometimes even into the opposition's half sometimes, to overwhelm the opposition and win the ball back. This leaves gaps behind the team's defence. So sweeping can be a necessary fail-safe to combat potential danger when the high pressing strategy fails.
Create a column to calculate the % of tackles made in final 3rd:
This is calculated as the number of tackles in the final third divided by total number of tackles.
#Calculate percentage of final 3rd tackles:
defensive_actions_data['pct_final_3rd_tackles'] = (defensive_actions_data['tackles_in_attacking_3rd'])/(defensive_actions_data['tackles_in_attacking_3rd'] + defensive_actions_data['tackles_in_middle_3rd'] + defensive_actions_data['tackles_in_defensive_3rd']) * 100
defensive_actions_data.query("season_name != '2022-2023'")['pct_final_3rd_tackles'].describe()
count 100.000000 mean 11.599535 std 2.975216 min 5.676127 25% 9.430769 50% 11.482735 75% 12.760867 max 20.363636 Name: pct_final_3rd_tackles, dtype: float64
The above table shows that the maximum percentage of tackles a team has commited in the final third is 20%, so this is the team that has been the most aggressive high pressing team over the past five seasons, excluding the ongoing 2022-2023 season. 50 percent of teams make between 9.7 percent and 13% of their tackles in the final third, as evidenced by the 25th and 75th percentiles.
Now, the newly created column and its 75th and 50th percentiles can be utilised to categorise the teams as High pressing and non high pressing teams: The same logic used to create the possesssion category is being followed.
pressing_75_percentile = pd.DataFrame(defensive_actions_data.groupby('season')['pct_final_3rd_tackles'].quantile(0.75))
pressing_75_percentile['pressing_final_3rd_pct_50th_percentile'] = defensive_actions_data.groupby('season')['pct_final_3rd_tackles'].quantile(0.5)
pressing_75_percentile = pressing_75_percentile.reset_index(level = 0 )
pressing_75_percentile.rename({'pct_final_3rd_tackles': 'pressing_final_3rd_pct_75th_percentile'}, axis= 1, inplace= True)
defensive_actions_data = pd.merge(defensive_actions_data, pressing_75_percentile.loc[:, ['season', 'pressing_final_3rd_pct_75th_percentile', 'pressing_final_3rd_pct_50th_percentile']], on = ['season'])
def high_pressing_category(defensive_actions_data):
if defensive_actions_data['pct_final_3rd_tackles'] > defensive_actions_data['pressing_final_3rd_pct_75th_percentile']:
defensive_actions_data['high_press_team'] = 'High Pressing Team'
elif defensive_actions_data['pct_final_3rd_tackles'] >= defensive_actions_data['pressing_final_3rd_pct_50th_percentile'] and defensive_actions_data['pct_final_3rd_tackles'] <= defensive_actions_data['pressing_final_3rd_pct_75th_percentile']:
defensive_actions_data['high_press_team'] = 'Average High Press Team'
else:
defensive_actions_data['high_press_team'] = 'Non High Pressing Team'
return(defensive_actions_data)
defensive_actions_data = defensive_actions_data.apply(high_pressing_category , axis = 'columns')
defensive_actions_data.head(3)
| squad | players_used | 90s_played | tackles_made | tackles_won | tackles_in_defensive_3rd | tackles_in_middle_3rd | tackles_in_attacking_3rd | tackles_won_vs_dribbles | total_tackles_vs_dribblers_includes_lost_plus_won | ... | errors_leading_to_goal | season_name | name | ranking | season | table_position_category | pct_final_3rd_tackles | pressing_final_3rd_pct_75th_percentile | pressing_final_3rd_pct_50th_percentile | high_press_team | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Arsenal | 23 | 14.0 | 206 | 128 | 94 | 69 | 43 | 88 | 162 | ... | 11 | 2022-2023 | defensive_actions_data | 1 | 2022 | Top 6 | 20.873786 | 16.672176 | 13.240478 | High Pressing Team |
| 1 | Manchester City | 21 | 14.0 | 175 | 102 | 67 | 77 | 31 | 90 | 151 | ... | 3 | 2022-2023 | defensive_actions_data | 2 | 2022 | Top 6 | 17.714286 | 16.672176 | 13.240478 | High Pressing Team |
| 2 | Newcastle Utd | 24 | 15.0 | 250 | 147 | 106 | 100 | 44 | 124 | 238 | ... | 3 | 2022-2023 | defensive_actions_data | 3 | 2022 | Top 6 | 17.600000 | 16.672176 | 13.240478 | High Pressing Team |
3 rows × 28 columns
Now, merging the newly created column to indicate a high press team, with the goalkeepers data frame, to answer the question:
goalkeeping_merged_data = pd.merge(goalkeeping_merged_data, defensive_actions_data.loc[:, ['season_name', 'squad', 'high_press_team', 'pct_final_3rd_tackles']], on = ['season_name', 'squad'])
goalkeeping_merged_data_past_5 = goalkeeping_merged_data.query(" season_name != '2022-2023' ")
Finally, answering the question of whether high press teams are employing more sweeping:
gk_plot_13 = sns.lmplot(data = goalkeeping_merged_data, x = 'avg_dist_of_defensive_actions', y = 'defensive_actions_outside_area', fit_reg = False
, hue = 'high_press_team', palette = palette_pressing, height = 6, aspect = 1.5, scatter_kws= {'s': 100})
plt.title('Analysing Sweeping for High Press Teams', fontsize = 20, loc = 'center',fontweight = 'bold',pad = 15)
plt.xlabel("Avg. Distance of Defensive Actions",fontsize = 15, labelpad = 15)
plt.ylabel("Defensive actions outside area",fontsize = 15, labelpad = 15)
plt.legend(title = 'Pressing Category:', loc = 'best')
gk_plot_13.legend.remove()
plt.show()
A lot is happening in the graph, let's break it down:
It cannot be concluded that teams who press high are employing sweeping as a tactic, if only because of the dearth of top-quality sweeper keepers.
Making a new data frame for the defensive performance data of teams, exlcuding the ongoing 2022-2023 season:
def_actions_past_5 = defensive_actions_data.query(" season_name != '2022-2023'")
This could be a sign that more teams are looking to win the ball back in the final third through high pressing.
plt.figure(figsize = (10,8))
def_plot_1 = sns.lineplot(data = def_actions_past_5 , x = 'season', y = 'pct_final_3rd_tackles', ci = None, linewidth = 5, estimator = np.mean)
def_plot_1.set_xticks([2017,2018,2019,2020,2021])
def_plot_1.set_yticks([10.75,11,11.25,11.50,11.75, 12, 12.25, 12.50, 12.75])
plt.title('Percentage of Tackles in Final 3rd', fontsize = 20, loc = 'center',fontweight = 'bold',pad = 15)
plt.xlabel("Season", fontsize = 15, labelpad = 15)
plt.ylabel("% of Tackles in Final 3rd", fontsize = 15, labelpad = 15)
plt.show()
The percentage of tackles made in the final 3rd has gradually increased every season from 2017 to 2021.
An increase of about 2 percent does not seem substantial. However, the previous section provided some context for this, showing that, cumulatively for the 2017 to 2021 seasons, 50 percent of teams make between 9.7 percent and 12.8% of their tackles in the final third. So a 2 percent increase is substantial, after the added context.
The trend suggested by the graph can be analysed from another point of view by comparing the 25th and 75th percentiles for the 2017 season and the 2021 season:
2017 season:
defensive_actions_data.query("season_name == '2017-2018'")['pct_final_3rd_tackles'].describe()
count 20.000000 mean 10.630773 std 2.673251 min 5.676127 25% 8.258480 50% 10.121267 75% 12.812276 max 16.163410 Name: pct_final_3rd_tackles, dtype: float64
2021 season:
defensive_actions_data.query("season_name == '2021-2022'")['pct_final_3rd_tackles'].describe()
count 20.000000 mean 12.672657 std 3.188174 min 8.023774 25% 11.550675 50% 12.044403 75% 13.740277 max 20.281124 Name: pct_final_3rd_tackles, dtype: float64
It is noticeable that the percent of final third tackles has increased in every metric.
Now, separate the graph by Table position to see clearer trends:
plt.figure(figsize = (10,8))
def_plot_2 = sns.lineplot(data = def_actions_past_5 , x = 'season', y = 'pct_final_3rd_tackles',hue = 'table_position_category' ,ci = None, estimator = np.mean ,linewidth = 5, palette = palette_table_position)
def_plot_2.set_xticks([2017,2018,2019,2020,2021])
plt.title('Percentage of Tackles in Final 3rd by Table Position', fontsize = 20, loc = 'center',fontweight = 'bold',pad = 15)
plt.xlabel("Season", fontsize = 15, labelpad = 15)
plt.ylabel("% of Tackles in Final 3rd", fontsize = 15, labelpad = 15)
plt.legend(title = 'Table Position:')
plt.show()
The above graph shows that:
For more context, we can check the % of tackles made in final 3rd by separating the High, Mid and Low pressing teams.
plt.figure(figsize = (10,8))
def_plot_3 = sns.lineplot(data = def_actions_past_5 , x = 'season', y = 'pct_final_3rd_tackles',hue = 'high_press_team' ,ci = None, linewidth = 5, palette = palette_pressing, estimator = np.mean)
def_plot_3.set_xticks([2017,2018,2019,2020,2021])
def_plot_3.set_ylim(8, 18)
plt.title('Percentage of Tackles in Final 3rd by Pressing Category', fontsize = 20, loc = 'center',fontweight = 'bold',pad = 15)
plt.xlabel("Season", fontsize = 15, labelpad = 15)
plt.ylabel("% of Tackles in Final 3rd", fontsize = 15, labelpad = 15)
plt.legend(title = 'Pressing Category:')
plt.show()
We now have an idea of how large a chunk of their tackles each type of team makes in the final 3rd.
To assess the defensive performance of teams, the data about how opponents are performing against each team will be scraped from Fbref and stored in pandas data frames.
Specifically, the xG against metric will be used to measure how much xG teams are giving up to their opponents. This will be used to assess if high pressing teams are benefiting defensively by pressing high and stopping attacks before their beginning.
#Get standard stats for opponent performance against each team:
def get_vs_squad_stats(url):
standard_stats_against_cols = ['squad', 'players_used', 'age', 'possession_against', 'matches_played', 'starts_by_player',
'total_minutes', '90s_played', 'goals_scored_against', 'assists_against', 'non_penalty_goals_against',
'penalties_against','penalties_attempted_against' ,'yellow_cards_against', 'red_cards_against', 'goals_per90_against', 'assists_per90_against',
'goals_plus_assists_per90_against', 'non_penalty_goals_per90_against', 'non_penalty_goals_plus_assists_per90_against',
'xg_against', 'non_penalty_xg_against', 'xa_against', 'non_penalty_xg_plus_xa_against', 'xg_per90_against', 'xa_per90_against',
'xg_plus_xa_per90_against', 'non_penalty_xg_per90_against', 'non_penalty_xg_plus_xa_per90_against']
standard_stats_against_data = pd.DataFrame()
try:
for season_name, season_url in urls_opponent.items():
page = requests.get(season_url)
standard_stats_against = pd.read_html(season_url, match = "Squad Standard Stats")[1]
standard_stats_against = pd.DataFrame(standard_stats_against)
standard_stats_against.columns = standard_stats_against.columns.droplevel()
standard_stats_against.columns = standard_stats_against_cols
standard_stats_against['season_name'] = season_name
standard_stats_against['name'] = 'standard_stats_against'
standard_stats_against_data = pd.concat([standard_stats_against_data, standard_stats_against], ignore_index = True)
time.sleep(20)
except HTTPError as error:
print(error)
print("The code could not be executed due to a HTTP error!")
except ValueError:
print("The table could not be found in the URL provided! Check the table Title")
except requests.exceptions.ConnectionError as e:
print("Connection could not be established! Check the URL!")
except IndexError:
print("There was a list index out of range error!")
return(standard_stats_against_data)
urls_opponent = {'2022-2023': 'https://fbref.com/en/comps/9/Premier-League-Stats',
'2021-2022': 'https://fbref.com/en/comps/9/2021-2022/2021-2022-Premier-League-Stats',
'2020-2021': 'https://fbref.com/en/comps/9/2020-2021/2020-2021-Premier-League-Stats',
'2019-2020': 'https://fbref.com/en/comps/9/2019-2020/2019-2020-Premier-League-Stats',
'2018-2019': 'https://fbref.com/en/comps/9/2018-2019/2018-2019-Premier-League-Stats',
'2017-2018': 'https://fbref.com/en/comps/9/2017-2018/2017-2018-Premier-League-Stats'}
standard_stats_against_data = get_vs_squad_stats(urls_opponent)
Running unit tests on the above returned data frame: The unittest library is being utilised to run unit tests on the above function and perform some test driven development. Four test cases are being verified:
class TestScraping(unittest.TestCase):
# Test whether the function exists:
def test_fun_exists(self):
self.assertIsNotNone(get_vs_squad_stats)
print("The function exists!")
# Test whether the data frame is not None, so something has been returned
def test_data_frame_exists(self):
self.assertIsNotNone(standard_stats_against_data)
print("The data frame exists!")
# Test whether the data frame has the correct number of rows:
def test_data_frame_rows_length(self):
self.assertGreaterEqual(len(standard_stats_against_data), 100)
print("The data frame has some data, the scraping is successful!")
# Test whether the data frame has the correct number of columns:
def test_data_frame_columns_length(self):
self.assertGreaterEqual(len(standard_stats_against_data.columns), 10)
print("The data frame has some columns, the scraping is successful!")
unittest.main(argv = ['ignored', '-v'], exit = False)
test_data_frame_columns_length (__main__.TestScraping) ... ok test_data_frame_exists (__main__.TestScraping) ... ok test_data_frame_rows_length (__main__.TestScraping) ... ok test_fun_exists (__main__.TestScraping) ...
The data frame has some columns, the scraping is successful! The data frame exists! The data frame has some data, the scraping is successful! The function exists!
ok ---------------------------------------------------------------------- Ran 4 tests in 0.005s OK
<unittest.main.TestProgram at 0x7ff3692bd6a0>
All the unit tests have passed and the scraping is successful.
Re-creating the two functions for data type checking and outlier checking:
#creating a function to get the number of missing values in each column and data types of each column:
def assess_data_frames(df_name):
print(f"There are {df_name.shape[0]} rows and {df_name.shape[1]} columns\n")
assess_missing_values = len(df_name.isna().sum()[df_name.isna().sum() > 0])
if (assess_missing_values) == 0:
print("There are no columns with missing values")
else:
print(f"These columns have missing values:\n\n{df_name.isna().sum()[df_name.isna().sum() > 0]}\n\nThese are the data types of the columns:\n ")
print(f"{df_name.dtypes}")
# Function to Assess if there are outliers in all columns of a data frame
def find_outliers_IQR(df):
outliers = []
if df.dtype == 'int64' or df.dtype == 'float64':
z = np.abs(stats.zscore(df))
outliers.append(df.loc[z > 3])
return(outliers)
else:
pass
Run the function to check for missing values and data types:
assess_data_frames(standard_stats_against_data)
There are 120 rows and 31 columns There are no columns with missing values squad object players_used int64 age float64 possession_against float64 matches_played int64 starts_by_player int64 total_minutes int64 90s_played float64 goals_scored_against int64 assists_against int64 non_penalty_goals_against int64 penalties_against int64 penalties_attempted_against int64 yellow_cards_against int64 red_cards_against int64 goals_per90_against float64 assists_per90_against float64 goals_plus_assists_per90_against float64 non_penalty_goals_per90_against float64 non_penalty_goals_plus_assists_per90_against float64 xg_against float64 non_penalty_xg_against float64 xa_against float64 non_penalty_xg_plus_xa_against float64 xg_per90_against float64 xa_per90_against float64 xg_plus_xa_per90_against float64 non_penalty_xg_per90_against float64 non_penalty_xg_plus_xa_per90_against float64 season_name object name object dtype: object
Run the function to check for outliers:
standard_stats_against_data.apply(find_outliers_IQR, axis = 'rows')
squad None players_used [[]] age [[]] possession_against [[]] matches_played [[]] starts_by_player [[]] total_minutes [[]] 90s_played [[]] goals_scored_against [[]] assists_against [[]] non_penalty_goals_against [[]] penalties_against [[]] penalties_attempted_against [[]] yellow_cards_against [[]] red_cards_against [[9]] goals_per90_against [[]] assists_per90_against [[]] goals_plus_assists_per90_against [[]] non_penalty_goals_per90_against [[]] non_penalty_goals_plus_assists_per90_against [[]] xg_against [[]] non_penalty_xg_against [[]] xa_against [[]] non_penalty_xg_plus_xa_against [[]] xg_per90_against [[]] xa_per90_against [[]] xg_plus_xa_per90_against [[]] non_penalty_xg_per90_against [[]] non_penalty_xg_plus_xa_per90_against [[]] season_name None name None dtype: object
Findings of data quality assessment:
The squad column has a prefix of "vs" in it to indicate that this is data for opponent's performance against that team. Removing the prefix:
standard_stats_against_data['squad'] = standard_stats_against_data['squad'].str.strip('vs ')
Now, adding the xG against and non-penalty xG against data back to the original defensive statistics data frame. Then re-create the smaller data frame that excludes the ongoing 2022-2023 season:
defensive_actions_data = pd.merge(defensive_actions_data, standard_stats_against_data.loc[:, ['season_name', 'squad', 'xg_against', 'non_penalty_xg_against']])
def_actions_past_5 = defensive_actions_data.query(" season_name != '2022-2023'")
Now plot the data:
def_plot_4 = sns.lmplot(data = def_actions_past_5, x = 'tackles_in_attacking_3rd', y = 'non_penalty_xg_against', ci = None,
scatter_kws= {'s': 50}, height = 6, aspect = 1.25)
plt.title('Tackles in final 3rd vs Non-Penalty xG conceded', fontsize = 20, loc = 'center',fontweight = 'bold',pad = 15)
plt.ylabel("Non-Penalty xG conceded", fontsize = 15, labelpad = 15)
plt.xlabel("Tackles in final 3rd", fontsize = 15, labelpad = 15)
def_plot_4.set(ylim = (10, 80))
def_plot_4.set(xlim = (20, 120))
plt.show()
The graph does show an inverse or negative relationship between the two metrics, indicating that teams who have a high amount of final third tackles could be conceding less xG. However, the relationship seems a weak one. This makes sense because defensive performance is about a variety of factors including the team's defensive structure, effective set piece defending, avoiding individual mistakes, managing the game when leading or trailing by a goal, etc. Pressing high and pressing effectively can be an effective method to stop attacks at their root, but it does not seem to directly correlate to great defensive performance through a reduction in xG conceded.
Teams press high with two sometimes mutually exclusive, sometimes related aims:
First, merging the data frame that contains metrics for shot creating actions, into the defence data frame. Then re-creating the smaller data frame that excludes the ongoing 2022-2023 season:
defensive_actions_data = pd.merge(defensive_actions_data, shooting_merged_data.loc[:,['squad', 'season_name', 'sca']], on = ['squad', 'season_name'])
def_actions_past_5 = defensive_actions_data.query(" season_name != '2022-2023'")
# Creating the plot:
def_plot_5 = sns.lmplot(data = def_actions_past_5 ,x = 'pct_final_3rd_tackles', y = 'sca', ci = None,
scatter_kws= {'s': 50}, height = 6, aspect = 1.25)
plt.title('Percent of Tackles in final 3rd vs Shot Creating Actions', fontsize = 20, loc = 'center',fontweight = 'bold',pad = 15)
plt.ylabel("Shot Creating actions", fontsize = 15, labelpad = 15)
plt.xlabel(" % of Tackles in final 3rd", fontsize = 15, labelpad = 15)
def_plot_5.set(xlim = (4, 24))
plt.show()
The graph shows a positive and relatively strong relationship between a high amount of tackles in the final third, and shot creating actions.
Check correlation between them to get a clearer sense of the relationship:
print(f"The correlation between Tackles in final 3rd and Shot creating actions is: {def_actions_past_5['pct_final_3rd_tackles'].corr(def_actions_past_5['sca'])} ")
The correlation between Tackles in final 3rd and Shot creating actions is: 0.6400902647412364
The correlation confirms the strong positive relationship. This shows that teams who press high effectively also tend to create a lot of chances. It is worth mentioning the correlation is not a causation caveat again. The graph and the correlation value just show that these metrics are related. The relationship could just be significantly strengthened because of the top 6-7 good teams in each season, who both press effectively in the final third and also create a lot of chances (shots).
i. Are they making more passes?
ii. Are they making more progressive passes?
This question and the previous question about passes in general, relates to a hypothesis that defenders are increasingly given more responsibility to progress the ball up the pitch to the attackers.
iii. Making more of their tackles in middle or defensive third??
The hypothesis here relates to the fact that higher defensive lines and pressing has led to defenders stepping out more to commit tackles and attempting to win the ball higher up the pitch than they used to traditionally.
First, the data for individual passing and defensive performance of all players in the past five seasons needs to be scraped from Fbref and stored in pandas data frames.
Here is a brief description of the process:
#Get the passing data for all players :
def get_passing_data_players(urls):
passing_player_data = pd.DataFrame()
passing_player_cols = ['ranking','player','nation','position','squad','age','born', '90s_played','passes_completed', 'passes_attempted',
'passes_completed_percent', 'passes_distance_travelled', 'passes_progressive_distance_travelled',
'short_passes_completed', 'short_passes_attempted','short_passes_completed_percent',
'medium_passes_completed', 'medium_passes_attempted','medium_passes_completed_percent',
'long_passes_completed', 'long_passes_attempted','long_passes_completed_percent',
'assists', 'expected_assisted_goals', 'xa', 'assists_minus_expected_assisted_goals',
'key_passes', 'final_third_entering_passes', 'passes_into_18_yard',
'crosses_into_18_yard', 'progressive_passes', 'season_name', 'name']
try:
for season_name, season_url in urls_passing.items():
browser = webdriver.Chrome()
browser.get(season_url)
html_source = browser.page_source
soup = bs.BeautifulSoup(html_source, "lxml")
table = str(soup.find_all("table", {"id": "stats_passing"}))
table = pd.read_html(table)
passing_players = pd.DataFrame(table[0])
passing_players = passing_players.droplevel(level = 0, axis = 1)
passing_players.drop('Matches', axis = 1, inplace= True)
passing_players['season_name'] = season_name
passing_players['name'] = 'passing_player_data'
passing_players = passing_players.query("Rk != 'Rk'")
passing_player_data = pd.concat([passing_player_data, passing_players], ignore_index = True)
time.sleep(20)
except HTTPError as error:
print(error)
print("The code could not be executed due to a HTTP error!")
except ValueError:
print("The table could not be found in the URL provided! Check the table Title")
except requests.exceptions.ConnectionError as e:
print("Connection could not be established! Check the URL!")
except IndexError:
print("There was a list index out of range error!")
finally:
browser.quit()
passing_player_data.columns = passing_player_cols
return(passing_player_data)
urls_passing = {'2022-2023': 'https://fbref.com/en/comps/9/passing/Premier-League-Stats#all_stats_passing',
'2021-2022': 'https://fbref.com/en/comps/9/2021-2022/passing/2021-2022-Premier-League-Stats',
'2020-2021': 'https://fbref.com/en/comps/9/2020-2021/passing/2020-2021-Premier-League-Stats',
'2019-2020': 'https://fbref.com/en/comps/9/2019-2020/passing/2019-2020-Premier-League-Stats',
'2018-2019': 'https://fbref.com/en/comps/9/2018-2019/passing/2018-2019-Premier-League-Stats',
'2017-2018': 'https://fbref.com/en/comps/9/2017-2018/passing/2017-2018-Premier-League-Stats'}
passing_player_data = get_passing_data_players(urls_passing)
Running unit tests on the above data:
class TestScraping(unittest.TestCase):
# Test whether the function exists:
def test_fun_exists(self):
self.assertIsNotNone(get_passing_data_players)
print("The function exists!")
# Test whether the data frame is not None, so something has been returned
def test_data_frame_exists(self):
self.assertIsNotNone(passing_player_data)
print("The data frame exists!")
# Test whether the data frame has the correct number of rows:
def test_data_frame_rows_length(self):
self.assertGreaterEqual(len(passing_player_data), 100)
print("The data frame has some data, the scraping is successful!")
# Test whether the data frame has the correct number of columns:
def test_data_frame_columns_length(self):
self.assertGreaterEqual(len(passing_player_data.columns), 10)
print("The data frame has some columns, the scraping is successful!")
unittest.main(argv = ['ignored', '-v'], exit = False)
test_data_frame_columns_length (__main__.TestScraping) ... ok test_data_frame_exists (__main__.TestScraping) ... ok test_data_frame_rows_length (__main__.TestScraping) ... ok test_fun_exists (__main__.TestScraping) ...
The data frame has some columns, the scraping is successful! The data frame exists! The data frame has some data, the scraping is successful! The function exists!
ok ---------------------------------------------------------------------- Ran 4 tests in 0.002s OK
<unittest.main.TestProgram at 0x7ff378df1a30>
All the unit tests have passed and the scraping is successful.
Checking data types and missing values:
assess_data_frames(passing_player_data)
There are 3108 rows and 33 columns These columns have missing values: passes_completed 2 passes_attempted 2 passes_completed_percent 26 passes_distance_travelled 2 passes_progressive_distance_travelled 2 short_passes_completed 2 short_passes_attempted 2 short_passes_completed_percent 56 medium_passes_completed 2 medium_passes_attempted 2 medium_passes_completed_percent 78 long_passes_completed 2 long_passes_attempted 2 long_passes_completed_percent 196 expected_assisted_goals 2 xa 2 assists_minus_expected_assisted_goals 2 key_passes 2 final_third_entering_passes 2 passes_into_18_yard 2 crosses_into_18_yard 2 progressive_passes 2 dtype: int64 These are the data types of the columns: ranking object player object nation object position object squad object age object born object 90s_played object passes_completed object passes_attempted object passes_completed_percent object passes_distance_travelled object passes_progressive_distance_travelled object short_passes_completed object short_passes_attempted object short_passes_completed_percent object medium_passes_completed object medium_passes_attempted object medium_passes_completed_percent object long_passes_completed object long_passes_attempted object long_passes_completed_percent object assists object expected_assisted_goals object xa object assists_minus_expected_assisted_goals object key_passes object final_third_entering_passes object passes_into_18_yard object crosses_into_18_yard object progressive_passes object season_name object name object dtype: object
Assessing outliers:
passing_player_data.apply(find_outliers_IQR, axis = 'rows')
ranking None player None nation None position None squad None age None born None 90s_played None passes_completed None passes_attempted None passes_completed_percent None passes_distance_travelled None passes_progressive_distance_travelled None short_passes_completed None short_passes_attempted None short_passes_completed_percent None medium_passes_completed None medium_passes_attempted None medium_passes_completed_percent None long_passes_completed None long_passes_attempted None long_passes_completed_percent None assists None expected_assisted_goals None xa None assists_minus_expected_assisted_goals None key_passes None final_third_entering_passes None passes_into_18_yard None crosses_into_18_yard None progressive_passes None season_name None name None dtype: object
There are no outliers in the data. But the data types of all the columns is a string, which is not appropriate. And there are some missing values.
Cleaning some of the observed issues in the data including:
# Some cleaning
passing_player_data['nation'] = passing_player_data['nation'].str.split(' ', expand = True)[1]
passing_player_data['age'] = passing_player_data['age'].str.split('-', expand = True)[0]
passing_player_data['season'] = passing_player_data['season_name'].str.split('-', expand = True)[0]
passing_player_data[['position', 'position_secondary']] = passing_player_data['position'].str.split(',', expand = True)
passing_player_data['position_secondary'] = passing_player_data['position_secondary'].fillna("None")
passing_player_data = passing_player_data.rename({'90s_played': 'nineties_played'}, axis = 1)
cols_object = ['player', 'nation', 'position','position_secondary' ,'squad','born' ,'season_name', 'name']
for col in passing_player_data.columns:
if col in cols_object:
continue
passing_player_data = passing_player_data.astype({col: 'float'})
passing_player_data.head(3)
| ranking | player | nation | position | squad | age | born | nineties_played | passes_completed | passes_attempted | ... | assists_minus_expected_assisted_goals | key_passes | final_third_entering_passes | passes_into_18_yard | crosses_into_18_yard | progressive_passes | season_name | name | season | position_secondary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | Brenden Aaronson | USA | MF | Leeds United | 22.0 | 2000 | 13.2 | 334.0 | 446.0 | ... | -0.9 | 27.0 | 27.0 | 7.0 | 2.0 | 28.0 | 2022-2023 | passing_player_data | 2022.0 | FW |
| 1 | 2.0 | Che Adams | SCO | FW | Southampton | 26.0 | 1996 | 11.8 | 178.0 | 257.0 | ... | -0.7 | 16.0 | 8.0 | 7.0 | 2.0 | 11.0 | 2022-2023 | passing_player_data | 2022.0 | None |
| 2 | 3.0 | Tyler Adams | USA | MF | Leeds United | 23.0 | 1999 | 13.0 | 619.0 | 752.0 | ... | -0.9 | 15.0 | 61.0 | 8.0 | 0.0 | 58.0 | 2022-2023 | passing_player_data | 2022.0 | None |
3 rows × 35 columns
Now, assessing the data frame for missing values using passes attemmpted and passes completed:
passing_player_data[passing_player_data['passes_attempted'].isnull() | passing_player_data['passes_completed'].isnull()][['player','nineties_played','passes_completed', 'passes_attempted']]
| player | nineties_played | passes_completed | passes_attempted | |
|---|---|---|---|---|
| 850 | Sonny Perkins | 0.0 | NaN | NaN |
| 2942 | Aiden O'Neill | 0.0 | NaN | NaN |
Assess missing values in other columns:
passing_player_data[passing_player_data['passes_completed_percent'].isnull() | passing_player_data['short_passes_completed_percent'].isnull()][['player','nineties_played','passes_completed_percent', 'short_passes_completed_percent']].head(10)
| player | nineties_played | passes_completed_percent | short_passes_completed_percent | |
|---|---|---|---|---|
| 61 | Cafú | 0.1 | 100.0 | NaN |
| 77 | Nathaniel Chalobah | 0.1 | 66.7 | NaN |
| 83 | Bobby Clark | 0.1 | 100.0 | NaN |
| 97 | Conor Coventry | 0.0 | NaN | NaN |
| 112 | Halil Dervişoğlu | 0.1 | NaN | NaN |
| 146 | Mateo Fernández | 0.0 | NaN | NaN |
| 170 | Ben Godfrey | 0.2 | 80.0 | NaN |
| 189 | Jan Paul van Hecke | 0.0 | 100.0 | NaN |
| 237 | Emil Krafth | 0.0 | NaN | NaN |
| 261 | Jamal Lowe | 0.0 | NaN | NaN |
The missing values are for players who have played in very few games and therefore may not even have played for enough time to complete a pass. These players will be filtered out. Keeping only the players who played at least 3 full games:
passing_player_data = passing_player_data.query(" nineties_played > 3 ")
Assessing data types and missing values now:
assess_data_frames(passing_player_data)
There are 2433 rows and 35 columns There are no columns with missing values ranking float64 player object nation object position object squad object age float64 born object nineties_played float64 passes_completed float64 passes_attempted float64 passes_completed_percent float64 passes_distance_travelled float64 passes_progressive_distance_travelled float64 short_passes_completed float64 short_passes_attempted float64 short_passes_completed_percent float64 medium_passes_completed float64 medium_passes_attempted float64 medium_passes_completed_percent float64 long_passes_completed float64 long_passes_attempted float64 long_passes_completed_percent float64 assists float64 expected_assisted_goals float64 xa float64 assists_minus_expected_assisted_goals float64 key_passes float64 final_third_entering_passes float64 passes_into_18_yard float64 crosses_into_18_yard float64 progressive_passes float64 season_name object name object season float64 position_secondary object dtype: object
No missing values. And the data types of the columns is also appropriate!
Web scraping can face issues or worse, fail if there are changes to the source code of Fbref, especially for player performance data that is loaded dynamically as mentioned.
Therefore, storing the data to local machine as a CSV file:
passing_player_data.to_csv('passing_player_data.csv')
Adding some necessary information to the player's performance data, from the goalkeeping data frame. So the player passing data will also now include some chartacteristics of the team they played in, in a particular season:
passing_player_data = pd.merge(passing_player_data, goalkeeping_merged_data.loc[:, ['squad', 'season_name', 'possession', 'high_possession_team', 'pct_final_3rd_tackles', 'high_press_team']], on = ['squad', 'season_name'])
Now, filtering the data to get only defenders who have at least played 15 matches a season:
passing_player_data_defenders = passing_player_data[passing_player_data['position'].str.contains("DF")]
passing_player_data_defenders = passing_player_data_defenders.query("nineties_played >= 15")
passing_player_data_defenders = passing_player_data_defenders.reset_index()
i. Check if the defenders are making more passes:
plt.figure(figsize = (10,8))
passing_player_data_defenders = passing_player_data_defenders.sort_values(by = 'season', ascending = True)
def_plot_6 = sns.lineplot(data = passing_player_data_defenders.query("season_name != '2022-2023'"), x = 'season', y = 'passes_attempted', ci = None, linewidth = 5, estimator = np.mean)
plt.title('Mean of Passes Attempted by Defenders per Season', fontsize = 20, loc = 'center',fontweight = 'bold',pad = 15)
plt.ylabel("Passes Attempted", fontsize = 15, labelpad = 15)
plt.xlabel("Season", fontsize = 15, labelpad = 15)
def_plot_6.set_xticks([2017,2018,2019,2020,2021])
plt.show()
Passes attempted by the defenders has gradually increased each season, reaching a crescendo in 2020, before seeing a decrease in the 2021 season. Defenders are increasingly being tasked with keeping the ball and progressing it forward. The stereotypical tall, strong defender with no ability on the ball is a thing of the past. Many defenders in the modern game have technical and passing quality that rivals midfielders, including players like Lisandro Martinez, Aymeric Laporte and Trent Alexander Arnold.
ii. Onto the next question, assess the trends in progressive passes and passes entering the final third by defenders?
Defenders are making (attempting) more passes as seasons progress. But are they being tasked with progressing the ball forward to dangerous areas?
f, ax = plt.subplots(1,2, figsize = (18,10))
def_plot_7 = sns.lineplot(data = passing_player_data_defenders.query("season_name != '2022-2023'"), x = 'season', y = 'progressive_passes', hue = 'high_possession_team', ci = None, ax = ax[0], estimator = np.mean, linewidth = 5, palette = palette_possession)
def_plot_8 = sns.lineplot(data = passing_player_data_defenders.query("season_name != '2022-2023'"), x = 'season', y = 'final_third_entering_passes', hue = 'high_possession_team', ci = None, ax = ax[1], estimator = np.mean, linewidth = 5, palette = palette_possession)
def_plot_7.set(xticks = [2017, 2018, 2019, 2020, 2021])
def_plot_7.set_title('Mean Progressive passes played by Defenders per season:',fontsize = 17, loc = 'center',fontweight = 'bold',pad = 15)
def_plot_7.set_xlabel('Season', fontsize = 15, labelpad = 15)
def_plot_7.set_ylabel('Progressive Passes by defenders', fontsize = 15, labelpad = 10)
def_plot_7.legend(title = 'Possession Category:')
def_plot_8.set(xticks = [2017, 2018, 2019, 2020, 2021])
def_plot_8.set_title('Mean Passes Entering Final 3rd made by defenders per season:',fontsize = 17, loc = 'center',fontweight = 'bold',pad = 15)
def_plot_8.set_xlabel('Season', fontsize = 15, labelpad = 15)
def_plot_8.set_ylabel('Passes Entering Final 3rd made by defenders', fontsize = 15, labelpad = 10)
def_plot_8.legend(title = 'Possession Category:')
plt.show()
First, assessing the progressive passes made by defenders:
Next, assessing the final third entering passes made by defenders:
iii. Next question, are defenders making more of their tackles in the middle or defensive third?
First, the individual player performance data for defenders needs to be scraped from Fbref: The same process as previously described is followed, with the selenium web driver used again.
#Get the Players data for defending performance :
def get_defending_player_data(url):
defending_player_data = pd.DataFrame()
defending_player_cols = ['ranking','player','nation','position','squad','age','born', '90s_played', 'tackles_made', 'tackles_won',
'tackles_in_defensive_3rd', 'tackles_in_middle_3rd', 'tackles_in_attacking_3rd',
'tackles_won_vs_dribbles', 'total_tackles_vs_dribblers_includes_lost_plus_won',
'tackles_vs_dribblers_success_percent', 'tackles_lost_vs_dribblers',
'blocks', 'shots_blocked', 'passes_blocked', 'interceptions',
'tackles_plus_interceptions', 'clearances', 'errors_leading_to_goal', 'season_name', 'name']
try:
for season_name, season_url in urls_defending.items():
browser = webdriver.Chrome()
browser.get(season_url)
html_source = browser.page_source
soup = bs.BeautifulSoup(html_source, "lxml")
table = str(soup.find_all("table", {"id": "stats_defense"}))
table = pd.read_html(table)
defense = pd.DataFrame(table[0])
defense = defense.droplevel(level = 0, axis = 1)
defense.drop('Matches', axis = 1, inplace= True)
defense['season_name'] = season_name
defense['name'] = 'defending_player_data'
defense = defense.query("Rk != 'Rk'")
defending_player_data = pd.concat([defending_player_data, defense], ignore_index = True)
time.sleep(20)
except HTTPError as error:
print(error)
print("The code could not be executed due to a HTTP error!")
except ValueError:
print("The table could not be found in the URL provided! Check the table Title")
except requests.exceptions.ConnectionError as e:
print("Connection could not be established! Check the URL!")
except IndexError:
print("There was a list index out of range error!")
finally:
browser.quit()
defending_player_data.columns = defending_player_cols
return(defending_player_data)
#define seasons URLs
urls_defending = {'2022-2023': 'https://fbref.com/en/comps/9/defense/Premier-League-Stats#all_stats_defense',
'2021-2022': 'https://fbref.com/en/comps/9/2021-2022/defense/2021-2022-Premier-League-Stats',
'2020-2021': 'https://fbref.com/en/comps/9/2020-2021/defense/2020-2021-Premier-League-Stats',
'2019-2020': 'https://fbref.com/en/comps/9/2019-2020/defense/2019-2020-Premier-League-Stats',
'2018-2019': 'https://fbref.com/en/comps/9/2018-2019/defense/2018-2019-Premier-League-Stats',
'2017-2018': 'https://fbref.com/en/comps/9/2017-2018/defense/2017-2018-Premier-League-Stats'}
defending_player_data = get_defending_player_data(urls_defending)
Running unit tests on the above function and data frame:
class TestScraping(unittest.TestCase):
# Test whether the function exists:
def test_fun_exists(self):
self.assertIsNotNone(get_defending_player_data)
print("The function exists!")
# Test whether the data frame is not None, so something has been returned
def test_data_frame_exists(self):
self.assertIsNotNone(defending_player_data)
print("The data frame exists!")
# Test whether the data frame has the correct number of rows:
def test_data_frame_rows_length(self):
self.assertGreaterEqual(len(defending_player_data), 100)
print("The data frame has some data, the scraping is successful!")
# Test whether the data frame has the correct number of columns:
def test_data_frame_columns_length(self):
self.assertGreaterEqual(len(defending_player_data.columns), 10)
print("The data frame has some columns, the scraping is successful!")
unittest.main(argv = ['ignored', '-v'], exit = False)
test_data_frame_columns_length (__main__.TestScraping) ... ok test_data_frame_exists (__main__.TestScraping) ... ok test_data_frame_rows_length (__main__.TestScraping) ... ok test_fun_exists (__main__.TestScraping) ...
The data frame has some columns, the scraping is successful! The data frame exists! The data frame has some data, the scraping is successful! The function exists!
ok ---------------------------------------------------------------------- Ran 4 tests in 0.005s OK
<unittest.main.TestProgram at 0x7ff38bfa44f0>
All the unit tests have passed and the scraping is successful.
Checking data types:
assess_data_frames(defending_player_data)
There are 3108 rows and 26 columns These columns have missing values: tackles_made 2 tackles_won 1 tackles_in_defensive_3rd 2 tackles_in_middle_3rd 2 tackles_in_attacking_3rd 2 tackles_won_vs_dribbles 2 total_tackles_vs_dribblers_includes_lost_plus_won 2 tackles_vs_dribblers_success_percent 364 tackles_lost_vs_dribblers 2 blocks 2 shots_blocked 2 passes_blocked 2 interceptions 1 tackles_plus_interceptions 2 clearances 2 errors_leading_to_goal 2 dtype: int64 These are the data types of the columns: ranking object player object nation object position object squad object age object born object 90s_played object tackles_made object tackles_won object tackles_in_defensive_3rd object tackles_in_middle_3rd object tackles_in_attacking_3rd object tackles_won_vs_dribbles object total_tackles_vs_dribblers_includes_lost_plus_won object tackles_vs_dribblers_success_percent object tackles_lost_vs_dribblers object blocks object shots_blocked object passes_blocked object interceptions object tackles_plus_interceptions object clearances object errors_leading_to_goal object season_name object name object dtype: object
The column 'tackles_vs_dribblers_success_percent' has a lot of missing values. Some other columns also have a small number of missing values, but these could be for players who have not played a lot of minutes.
All the columns are also stored as strings.
Assess outliers:
defending_player_data.apply(find_outliers_IQR, axis = 'rows')
ranking None player None nation None position None squad None age None born None 90s_played None tackles_made None tackles_won None tackles_in_defensive_3rd None tackles_in_middle_3rd None tackles_in_attacking_3rd None tackles_won_vs_dribbles None total_tackles_vs_dribblers_includes_lost_plus_won None tackles_vs_dribblers_success_percent None tackles_lost_vs_dribblers None blocks None shots_blocked None passes_blocked None interceptions None tackles_plus_interceptions None clearances None errors_leading_to_goal None season_name None name None dtype: object
There are no outliers.
Doing some initial cleaning of the data including:
# Some cleaning
defending_player_data['nation'] = defending_player_data['nation'].str.split(' ', expand = True)[1]
defending_player_data['age'] = defending_player_data['age'].str.split('-', expand = True)[0]
defending_player_data['season'] = defending_player_data['season_name'].str.split('-', expand = True)[0]
defending_player_data[['position', 'position_secondary']] = defending_player_data['position'].str.split(',', expand = True)
defending_player_data['position_secondary'] = defending_player_data['position_secondary'].fillna("None")
defending_player_data = defending_player_data.rename({'90s_played': 'nineties_played'}, axis = 1)
cols_object = ['player', 'nation', 'position','position_secondary' ,'squad','born' ,'season_name', 'name']
for col in defending_player_data.columns:
if col in cols_object:
continue
defending_player_data = defending_player_data.astype({col: 'float'})
defending_player_data.head(3)
| ranking | player | nation | position | squad | age | born | nineties_played | tackles_made | tackles_won | ... | shots_blocked | passes_blocked | interceptions | tackles_plus_interceptions | clearances | errors_leading_to_goal | season_name | name | season | position_secondary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | Brenden Aaronson | USA | MF | Leeds United | 22.0 | 2000 | 13.2 | 24.0 | 6.0 | ... | 1.0 | 20.0 | 1.0 | 25.0 | 5.0 | 0.0 | 2022-2023 | defending_player_data | 2022.0 | FW |
| 1 | 2.0 | Che Adams | SCO | FW | Southampton | 26.0 | 1996 | 11.8 | 10.0 | 9.0 | ... | 2.0 | 6.0 | 3.0 | 13.0 | 14.0 | 0.0 | 2022-2023 | defending_player_data | 2022.0 | None |
| 2 | 3.0 | Tyler Adams | USA | MF | Leeds United | 23.0 | 1999 | 13.0 | 52.0 | 28.0 | ... | 4.0 | 17.0 | 16.0 | 68.0 | 13.0 | 0.0 | 2022-2023 | defending_player_data | 2022.0 | None |
3 rows × 28 columns
assess the missing values:
defending_player_data[defending_player_data['tackles_vs_dribblers_success_percent'].isnull()][['player','nineties_played' ,'tackles_won_vs_dribbles','tackles_vs_dribblers_success_percent']].head(10)
| player | nineties_played | tackles_won_vs_dribbles | tackles_vs_dribblers_success_percent | |
|---|---|---|---|---|
| 12 | Alisson | 14.0 | 0.0 | NaN |
| 13 | Dele Alli | 0.4 | 0.0 | NaN |
| 22 | Cameron Archer | 0.5 | 0.0 | NaN |
| 23 | Alphonse Areola | 1.2 | 0.0 | NaN |
| 29 | Ludwig Augustinsson | 0.6 | 0.0 | NaN |
| 36 | Stefan Bajcetic | 0.2 | 0.0 | NaN |
| 41 | Mads Bech Sørensen | 0.2 | 0.0 | NaN |
| 42 | Jan Bednarek | 1.8 | 0.0 | NaN |
| 45 | Asmir Begović | 1.0 | 0.0 | NaN |
| 49 | Owen Bevan | 0.1 | 0.0 | NaN |
The missing values in the tackles_vs_dribblers_success_percent column are for goal keepers and outfielders who have not won a tackle against a dribbling opponent. These rows can be filled with 0, zero tackles were attempted agaisnt a dribbler, so there is a zero percent success rate.
defending_player_data['tackles_vs_dribblers_success_percent'] = defending_player_data['tackles_vs_dribblers_success_percent'].fillna(0)
Checking the missing values in the column now:
defending_player_data[defending_player_data['tackles_vs_dribblers_success_percent'].isnull()][['player','nineties_played' ,'tackles_won_vs_dribbles','tackles_vs_dribblers_success_percent']].head(10)
| player | nineties_played | tackles_won_vs_dribbles | tackles_vs_dribblers_success_percent |
|---|
Assessing the rest of the missing values:
defending_player_data[defending_player_data['tackles_in_defensive_3rd'].isnull() | defending_player_data['tackles_in_middle_3rd'].isnull()][['player','nineties_played' ,'tackles_in_middle_3rd','tackles_in_attacking_3rd', 'clearances', 'errors_leading_to_goal']].head(10)
| player | nineties_played | tackles_in_middle_3rd | tackles_in_attacking_3rd | clearances | errors_leading_to_goal | |
|---|---|---|---|---|---|---|
| 850 | Sonny Perkins | 0.0 | NaN | NaN | NaN | NaN |
| 2942 | Aiden O'Neill | 0.0 | NaN | NaN | NaN | NaN |
Keeping only players who played at least 2 full games:
defending_player_data = defending_player_data.query("nineties_played > 2 ")
Assessing missing values now:
assess_data_frames(defending_player_data)
There are 2567 rows and 28 columns There are no columns with missing values ranking float64 player object nation object position object squad object age float64 born object nineties_played float64 tackles_made float64 tackles_won float64 tackles_in_defensive_3rd float64 tackles_in_middle_3rd float64 tackles_in_attacking_3rd float64 tackles_won_vs_dribbles float64 total_tackles_vs_dribblers_includes_lost_plus_won float64 tackles_vs_dribblers_success_percent float64 tackles_lost_vs_dribblers float64 blocks float64 shots_blocked float64 passes_blocked float64 interceptions float64 tackles_plus_interceptions float64 clearances float64 errors_leading_to_goal float64 season_name object name object season float64 position_secondary object dtype: object
There are no missing values now.
Write the newly scraped data to a CSV File:
defending_player_data.to_csv('defending_player_data.csv')
Add the necessary information to the dataset from the goalkeeping data frame. Then filter the data frame to include only defenders who have played at least 15 matches in the season. The position column contains the position a player plays in and 'DF' indicates a defender.
defending_player_data = pd.merge(defending_player_data, goalkeeping_merged_data.loc[:, ['squad', 'season_name', 'possession', 'high_possession_team', 'pct_final_3rd_tackles', 'high_press_team']], on = ['squad', 'season_name'])
defending_player_data_defenders = defending_player_data[defending_player_data['position'].str.contains("DF")]
defending_player_data_defenders = defending_player_data_defenders.query("nineties_played > 15")
defending_player_data_defenders = defending_player_data_defenders.reset_index()
Now plot tackles plus interceptions by defenders:
plt.figure(figsize = (10,8))
def_plot_9 = sns.lineplot(data = defending_player_data_defenders, x = 'season', y = 'tackles_plus_interceptions', ci = None, estimator = np.mean, linewidth = 5)
def_plot_9.set_xticks([2017, 2018, 2019, 2020, 2021])
plt.title('Mean Tackles + Interceptions by Defenders per season',fontsize = 20, loc = 'center',fontweight = 'bold',pad = 15)
plt.ylabel("Tackles + Intereptions by Defenders", fontsize = 15, labelpad = 15)
plt.xlabel("Season", fontsize = 15, labelpad = 15)
plt.show()
Number of tackles plus interceptions by defenders has decreased gradually, after 2018.
Now check if Defenders are making more tackles in the middle 3rd:
plt.figure(figsize = (10,8))
def_plot_10 = sns.lineplot(data = defending_player_data_defenders, x = 'season', y = 'tackles_in_middle_3rd' ,ci = None, estimator = np.mean, linewidth = 5)
def_plot_10.set_xticks([2017, 2018, 2019, 2020, 2021])
plt.title('Mean tackles in middle third by Defenders per season', fontsize = 20, loc = 'center',fontweight = 'bold',pad = 15)
plt.ylabel("Tackles in middle 3rd by Defenders", fontsize = 15, labelpad = 15)
plt.xlabel("Season", fontsize = 15, labelpad = 15)
plt.show()
The number of tackles in the middle third by defenders shows too many fluctuations. Making a plot to check if the tackles made in the defensive third by defenders has changed:
plt.figure(figsize = (10,8))
def_plot_11 = sns.lineplot(data = defending_player_data_defenders, x = 'season', y = 'tackles_in_defensive_3rd', ci = None, estimator = np.mean, linewidth = 5)
def_plot_11.set_xticks([2017, 2018, 2019, 2020, 2021])
plt.title('Mean Tackles in defensive 3rd by Defenders per season', fontsize = 20, loc = 'center',fontweight = 'bold',pad = 15)
plt.ylabel("Tackles in defensive 3rd by defenders", fontsize = 15, labelpad = 15)
plt.xlabel("Season", fontsize = 15, labelpad = 15)
plt.show()
After a rise in the 2018 season, the number of tackles made by defenders in the defensive third has decreased. This could be due to:
Assessing the questions about the evolving role of defenders:
plt.figure(figsize = (10,8))
pass_plot_1 = sns.lineplot(data = passing_merged_data.query("season_name != '2022-2023'"), x = 'season', y = 'through_balls', estimator = np.mean, ci = None, linewidth = 5)
pass_plot_1.set_xticks([2017, 2018, 2019, 2020, 2021])
plt.title('Mean Through Balls per Season', fontsize = 20, loc = 'center',fontweight = 'bold',pad = 15)
plt.ylabel("Number of Through Balls", fontsize = 15, labelpad = 15)
plt.xlabel("Season", fontsize = 15, labelpad = 15)
plt.show()
The number of through balls being played over the season has decreased, but seen a sharp increase in the past season. Liam Tharme of The Athletic has written an excellent article [12] detailing through balls and possible reasons for their decline, therefore this paper will not delve deep into it.
i. Are they making more passes?
ii. Are they making more Progressive Passes?
This question along with the first question, relate to assessing the role of midfielders in building up attacks and progressing the ball forward.
iii. Are they making more or less tackles and Interceptions?
This question relates to assessing the role of midfielders in defence and how their defensive responsibilities have changed.
First, make a data frame for midfielders and keep only midfielders who have played at least 10 full games in the season. The position column contains the position a player plays in and 'MF' indicates a midfielder.
passing_player_data_midfielders = passing_player_data[passing_player_data['position'].str.contains("MF")]
passing_player_data_midfielders = passing_player_data_midfielders.rename({'90s_played': 'nineties_played'}, axis = 1)
passing_player_data_midfielders = passing_player_data_midfielders.query("nineties_played > 10")
passing_player_data_midfielders = passing_player_data_midfielders.reset_index()
passing_player_data_midfielders = passing_player_data_midfielders.sort_values(by = 'season', ascending = True)
i. Now plot the number of passes attempted by midfielders per season:
plt.figure(figsize = (10,8))
pass_plot_2 = sns.lineplot(data = passing_player_data_midfielders.query("season_name != '2022-2023'"), x = 'season', y = 'passes_attempted', ci = None, estimator = np.mean, linewidth = 5)
plt.title('Average Passes Attempted by Midfielders per Season', fontsize = 20, loc = 'center',fontweight = 'bold',pad = 15)
plt.ylabel("Average Passes Attempted by Midfielders", fontsize = 15, labelpad = 15)
plt.xlabel("Season", fontsize = 15, labelpad = 15)
pass_plot_2.set_xticks([2017, 2018, 2019, 2020, 2021])
pass_plot_2.set(ylim = (1040, 1200))
plt.show()
The graph shows that the average number of passes made by midfielders has fluctuated every season.
ii. Now assess number of progressive passes made by midfielders:
plt.figure(figsize = (10,8))
pass_plot_3 = sns.lineplot(data = passing_player_data_midfielders.query("season_name != '2022-2023'"), x = 'season', y = 'progressive_passes', ci = None, estimator = np.mean, linewidth = 5)
plt.title('Average Progressive Passes Made by Midfielders per Season', fontsize = 20, loc = 'center',fontweight = 'bold',pad = 15)
plt.ylabel("Progressive Passes by Midfielders", fontsize = 15, labelpad = 15)
plt.xlabel("Season", fontsize = 15, labelpad = 15)
pass_plot_3.set_xticks([2017, 2018, 2019, 2020, 2021])
pass_plot_3.set(ylim = (70, 92.5))
plt.show()
The average number of progressive passes made by midfielders has reduced as seasons progress.
Check the progressive passes for midfielders, but separated by Table position:
First, the information about table position needs to be merged into the player passing data.
passing_player_data_midfielders = pd.merge(passing_player_data_midfielders, standard_stats_data.loc[:, ['squad', 'season_name', 'ranking', 'table_position_category']], on = ['squad', 'season_name'])
plt.figure(figsize = (10,8))
pass_plot_4 = sns.lineplot(data = passing_player_data_midfielders.query("season_name != '2022-2023'"), x = 'season', y = 'progressive_passes',hue = 'table_position_category' ,
ci = None, estimator = np.mean, linewidth = 5, palette = palette_table_position)
plt.title('Average Progressive Passes by Midfielders by Table Position', fontsize = 20, loc = 'center',fontweight = 'bold',pad = 15)
plt.ylabel("Progressive Passes by Midfielders", fontsize = 15, labelpad = 15)
plt.xlabel("Season", fontsize = 15, labelpad = 15)
pass_plot_4.set_xticks([2017, 2018, 2019, 2020, 2021])
#pass_plot_4.set(ylim = (66, 84))
plt.legend(title = 'Table Position:')
plt.show()
Now plot progressive passes by high possession teams:
plt.figure(figsize = (10,8))
pass_plot_5 = sns.lineplot(data = passing_player_data_midfielders.query("season_name != '2022-2023'"), x = 'season', y = 'progressive_passes',hue = 'high_possession_team' ,
ci = None, estimator = np.mean, linewidth = 5, palette = palette_possession)
plt.title('Average Progressive Passes by Midfielders by Possession Category', fontsize = 20, loc = 'center',fontweight = 'bold',pad = 15)
plt.ylabel("Progressive Passes by Midfielders", fontsize = 15, labelpad = 15)
plt.xlabel("Season", fontsize = 15, labelpad = 15)
pass_plot_5.set_xticks([2017, 2018, 2019, 2020, 2021])
plt.legend(title = 'Possession Category:')
plt.show()
The above three graphs show that midfielders have generally shown a reduction in the average amount of progressive passes they make. Interestingly, even midfielders from high possession teams have seen a decrease in the average amount of progressive passes they are making. The trend was also backed up by midfielders in the Top 6 teams, these are teams who possess some of the best passers in world football!
iii. Assess the tackles being made by midfielders?
First, make a data frame for midfielders tackling performance and defensive stats:
defending_player_data_midfielders = defending_player_data[defending_player_data['position'].str.contains("MF")]
defending_player_data_midfielders = defending_player_data_midfielders.rename({'90s_played': 'nineties_played'}, axis = 1)
defending_player_data_midfielders = defending_player_data_midfielders.query("nineties_played > 10")
defending_player_data_midfielders = defending_player_data_midfielders.reset_index()
Plot tackles plus interceptions for midfielders:
plt.figure(figsize = (10,8))
pass_plot_6 = sns.lineplot(data = defending_player_data_midfielders.query("season_name != '2022-2023'"), x = 'season', y = 'tackles_plus_interceptions', ci = None, estimator = np.mean, linewidth = 5)
pass_plot_6.set_xticks([2017, 2018, 2019, 2020, 2021])
plt.title('Average Tackles plus Interceptions by Midfielders per Season', fontsize = 20, loc = 'center',fontweight = 'bold',pad = 15)
plt.ylabel("Tackles plus Interceptions by Midfielders", fontsize = 15, labelpad = 15)
plt.xlabel("Season", fontsize = 15, labelpad = 15)
plt.show()
Average amount of tackles plus interceptions made made by midfielders has decreased each season, starting from 2018.
Assessing the questions about the evolving role of midfielders:
First, making a data frame for defensive metrics data for forwards: The position columns is filtered by using 'FW' which indicates a player who plays as a forward. Only players who played more than 10 full games in a season are kept.
defending_player_data_forwards = defending_player_data[defending_player_data['position'].str.contains("FW")]
defending_player_data_forwards = defending_player_data_forwards.rename({'90s_played': 'nineties_played'}, axis = 1)
defending_player_data_forwards = defending_player_data_forwards.query("nineties_played > 10")
defending_player_data_forwards = defending_player_data_forwards.reset_index()
Now plot tackles in final 3rd for just the forwards:
plt.figure(figsize = (10,8))
pressing_plot_1 = sns.lineplot(data = defending_player_data_forwards.query(" season_name != '2022-2023' "), x = 'season', y = 'tackles_in_attacking_3rd' ,ci = None, estimator = np.mean, linewidth = 5)
pressing_plot_1.set_xticks([2017, 2018, 2019, 2020, 2021])
plt.title('Mean Tackles in Final 3rd by Forwards per Season', fontsize = 20, loc = 'center',fontweight = 'bold',pad = 15)
plt.ylabel("Tackles in Final 3rd by Forwards", fontsize = 15, labelpad = 15)
plt.xlabel("Season", fontsize = 15, labelpad = 15)
#pressing_plot_1.set(ylim = (300, 440))
plt.show()
The graph shows that the average number of tackles made by forwards has not altered drastically, this is despite the increase in the 2021 season.
This can be attributed to the fact that forwards are not naturally good tacklers. Therefore, even high-pressing, which relies on forwards being diligent in their pressing, is designed by managers in a way that the whole team together suffocates the opposition into conceding the ball. This includes pressing traps, where the opponent is not pressured in certain areas, but once the ball moves towards another area, they are pressed. The touchline is a common pressing trap because space is restricted near the touchline. In such a cleverly designed pressing system, forwards are tasked more with cutting off passes and closing gaps, and are not always tasked with making a tackle to win the ball high.
First the data for how opponent goalkeepers perform and pass against each team needs to be extracted from Fbref:
#Get the gk stats for opponents table
def get_vs_gk_stats(urls):
advanced_goalkeeping_cols = ['squad', 'players_used', '90s_played', 'goals_conceded_against', 'penalties_conceded_against',
'free_kicks_conceded_against', 'corners_conceded_against', 'own_goals_conceded_against', 'post_shot_xg_against',
'post_shot_xg_per_shot_on_target_against', 'post_shot_xg_minus_goals_conceded_against',
'post_shot_xg_minus_goals_conceded_per90_against',
'launched_passes_completed_against', 'launched_passes_attempted_against', 'launched_passes_completed_percent_against',
'passes_attempted_against', 'throws_attempted_against', 'passes_launched_percent_against', 'pass_avg_length_against',
'goal_kicks_attempted_against', 'goal_kicks_launched_percent_against', 'goal_kicks_avg_length_against',
'crosses_attempted_against_against', 'crosses_stopped_by_against', 'crosses_stop_percent_against',
'defensive_actions_outside_area_against', 'defensive_actions_outside_area_per90_against',
'avg_dist_of_defensive_actions_against']
adv_goalkeeping_against_data = pd.DataFrame()
try:
for season_name, season_url in urls_gk_opponent.items():
page = requests.get(season_url)
adv_goalkeeping_against = pd.read_html(season_url, match = "Squad Advanced Goalkeeping")[1]
adv_goalkeeping_against = pd.DataFrame(adv_goalkeeping_against)
adv_goalkeeping_against.columns = adv_goalkeeping_against.columns.droplevel()
adv_goalkeeping_against.columns = advanced_goalkeeping_cols
adv_goalkeeping_against['season_name'] = season_name
adv_goalkeeping_against['name'] = 'advanced_goalkeeping_against'
adv_goalkeeping_against_data = pd.concat([adv_goalkeeping_against_data, adv_goalkeeping_against], ignore_index = True)
time.sleep(20)
except HTTPError as error:
print(error)
print("The code could not be executed due to a HTTP error!")
except ValueError:
print("The table could not be found in the URL provided! Check the table Title")
except requests.exceptions.ConnectionError as e:
print("Connection could not be established! Check the URL!")
except IndexError:
print("There was a list index out of range error!")
return(adv_goalkeeping_against_data)
urls_gk_opponent = {'2022-2023': 'https://fbref.com/en/comps/9/Premier-League-Stats',
'2021-2022': 'https://fbref.com/en/comps/9/2021-2022/2021-2022-Premier-League-Stats',
'2020-2021': 'https://fbref.com/en/comps/9/2020-2021/2020-2021-Premier-League-Stats',
'2019-2020': 'https://fbref.com/en/comps/9/2019-2020/2019-2020-Premier-League-Stats',
'2018-2019': 'https://fbref.com/en/comps/9/2018-2019/2018-2019-Premier-League-Stats',
'2017-2018': 'https://fbref.com/en/comps/9/2017-2018/2017-2018-Premier-League-Stats'}
adv_goalkeeping_against_data = get_vs_gk_stats(urls_gk_opponent)
Unit testing the above function and data frame:
class TestScraping(unittest.TestCase):
# Test whether the function exists:
def test_fun_exists(self):
self.assertIsNotNone(get_vs_gk_stats)
print("The function exists!")
# Test whether the data frame is not None, so something has been returned
def test_data_frame_exists(self):
self.assertIsNotNone(adv_goalkeeping_against_data)
print("The data frame exists!")
# Test whether the data frame has the correct number of rows:
def test_data_frame_rows_length(self):
self.assertGreaterEqual(len(adv_goalkeeping_against_data), 100)
print("The data frame has some data, the scraping is successful!")
# Test whether the data frame has the correct number of columns:
def test_data_frame_columns_length(self):
self.assertGreaterEqual(len(adv_goalkeeping_against_data.columns), 10)
print("The data frame has some columns, the scraping is successful!")
unittest.main(argv = ['ignored', '-v'], exit = False)
test_data_frame_columns_length (__main__.TestScraping) ... ok test_data_frame_exists (__main__.TestScraping) ... ok test_data_frame_rows_length (__main__.TestScraping) ... ok test_fun_exists (__main__.TestScraping) ...
The data frame has some columns, the scraping is successful! The data frame exists! The data frame has some data, the scraping is successful! The function exists!
ok ---------------------------------------------------------------------- Ran 4 tests in 0.004s OK
<unittest.main.TestProgram at 0x7ff37a48ab80>
All the unit tests passed and the scraping was successful.
Assess data types and missing values:
assess_data_frames(adv_goalkeeping_against_data)
There are 120 rows and 30 columns There are no columns with missing values squad object players_used int64 90s_played float64 goals_conceded_against int64 penalties_conceded_against int64 free_kicks_conceded_against int64 corners_conceded_against int64 own_goals_conceded_against int64 post_shot_xg_against float64 post_shot_xg_per_shot_on_target_against float64 post_shot_xg_minus_goals_conceded_against float64 post_shot_xg_minus_goals_conceded_per90_against float64 launched_passes_completed_against int64 launched_passes_attempted_against int64 launched_passes_completed_percent_against float64 passes_attempted_against int64 throws_attempted_against int64 passes_launched_percent_against float64 pass_avg_length_against float64 goal_kicks_attempted_against int64 goal_kicks_launched_percent_against float64 goal_kicks_avg_length_against float64 crosses_attempted_against_against int64 crosses_stopped_by_against int64 crosses_stop_percent_against float64 defensive_actions_outside_area_against int64 defensive_actions_outside_area_per90_against float64 avg_dist_of_defensive_actions_against float64 season_name object name object dtype: object
There are no missing values and the data types of the columns are all appropriate.
Assess outliers:
adv_goalkeeping_against_data.apply(find_outliers_IQR, axis = 'rows')
squad None players_used [[]] 90s_played [[]] goals_conceded_against [[]] penalties_conceded_against [[]] free_kicks_conceded_against [[4, 4, 4, 5]] corners_conceded_against [[15, 15]] own_goals_conceded_against [[6]] post_shot_xg_against [[]] post_shot_xg_per_shot_on_target_against [[0.17]] post_shot_xg_minus_goals_conceded_against [[]] post_shot_xg_minus_goals_conceded_per90_against [[-0.67]] launched_passes_completed_against [[]] launched_passes_attempted_against [[]] launched_passes_completed_percent_against [[]] passes_attempted_against [[]] throws_attempted_against [[]] passes_launched_percent_against [[]] pass_avg_length_against [[]] goal_kicks_attempted_against [[]] goal_kicks_launched_percent_against [[]] goal_kicks_avg_length_against [[]] crosses_attempted_against_against [[]] crosses_stopped_by_against [[]] crosses_stop_percent_against [[10.9]] defensive_actions_outside_area_against [[74]] defensive_actions_outside_area_per90_against [[2.21, 2.13]] avg_dist_of_defensive_actions_against [[19.9]] season_name None name None dtype: object
Assessing the outliers through domain knowledge, none of the values returned are outliers through errors in the data.
Write the table to CSV file:
adv_goalkeeping_against_data.to_csv('adv_goalkeeping_against_data.csv')
Performing some cleaning on the data including:
adv_goalkeeping_against_data['squad'] = adv_goalkeeping_against_data['squad'].str.strip('vs ')
adv_goalkeeping_against_data = pd.merge(adv_goalkeeping_against_data, goalkeeping_merged_data.loc[:, ['season_name', 'squad', 'ranking', 'table_position_category', 'possession', 'possession_75th_percentile', 'possession_50th_percentile', 'high_possession_team', 'high_press_team', 'pct_final_3rd_tackles']], on = ['squad', 'season_name'])
adv_goalkeeping_against_data['season'] = adv_goalkeeping_against_data['season_name'].str.split('-', expand = True)[0]
Make a data frame for data against ONLY HIGH PRESS TEAMS:
vs_high_press = adv_goalkeeping_against_data.query("high_press_team == 'High Pressing Team' ")
vs_high_press = vs_high_press.sort_values(by = 'season')
Now finding an answer to the question of whether high pressing teams forcing their opponent goal keepers to resort to long balls?
plt.figure(figsize = (10,8))
pressing_plot_2 = sns.lineplot(data = vs_high_press.query("season_name != '2022-2023'"), x = 'season', y = 'goal_kicks_avg_length_against', ci = None, estimator = np.mean, linewidth = 5)
plt.title('Average Goal kick length against High Press teams per Season', fontsize = 20, loc = 'left',fontweight = 'bold',pad = 15)
plt.ylabel("Average Length of Goal Kicks", fontsize = 15, labelpad = 15)
plt.xlabel("Season", fontsize = 15, labelpad = 15)
plt.show()
The answer is that high-pressing teams have not forced the opponent goalkeepers to launch long balls from goal kicks. This can be attributed to two related factors:
In the survival of the fittest world of football, not adapting means bad results. Therefore, even though teams are becoming better at pressing, their opponents are not resorting to long balls in panic.
Now plot goal kick launch % and open play passes launched % against high press teams:
f, ax = plt.subplots(1,2, figsize = (18,10))
pressing_plot_3 = sns.lineplot(data = vs_high_press, x = 'season', y = 'goal_kicks_launched_percent_against', ci = None, estimator = np.mean, linewidth = 5, ax = ax[0])
pressing_plot_4 = sns.lineplot(data = vs_high_press, x = 'season', y = 'launched_passes_attempted_against', ci = None, estimator = np.mean, linewidth = 5, ax = ax[1])
pressing_plot_3.set_title('% of Goal Kicks Launched against high press teams per Season:',fontsize = 16, loc = 'center',fontweight = 'bold',pad = 15)
pressing_plot_3.set_xlabel('Season', fontsize = 15, labelpad = 15)
pressing_plot_3.set_ylabel('% of Goal Kicks Launched Long', fontsize = 15, labelpad = 10)
pressing_plot_4.set_title('Passes Launched against high press teams per Season:',fontsize = 16, loc = 'center',fontweight = 'bold',pad = 15)
pressing_plot_4.set_xlabel('Season', fontsize = 15, labelpad = 15)
pressing_plot_4.set_ylabel('Passes Launched Long', fontsize = 15, labelpad = 10)
plt.show()
The above subplot shows that:
Now finally, check pass length by the goalkeeper against high press teams:
plt.figure(figsize = (10,8))
pressing_plot_5 = sns.lineplot(data = vs_high_press, x = 'season', y = 'pass_avg_length_against', ci = None,
estimator = np.mean, linewidth = 5)
plt.title('Goalkeeper Pass Length against High Press teams per Season', fontsize = 20, loc = 'center',fontweight = 'bold',pad = 15)
plt.ylabel("Goalkeeper Pass Length", fontsize = 15, labelpad = 15)
plt.xlabel("Season", fontsize = 15, labelpad = 15)
plt.show()
The average pass length by goal keepers, against high pressing teams, has also decreased.
Data from Fbref, for how opponents pass against each team in the outfield needs to be scraped:
# Get passing stats for opponents:
def get_vs_opponent_passing(urls):
passing_opponent_cols = ['squad', 'players_used', '90s_played','passes_completed_against', 'passes_attempted_against',
'passes_completed_percent_against', 'passes_distance_travelled_against', 'passes_progressive_distance_travelled_against',
'short_passes_completed_against', 'short_passes_attempted_against','short_passes_completed_percent_against',
'medium_passes_completed_against', 'medium_passes_attempted_against','medium_passes_completed_percent_against',
'long_passes_completed_against', 'long_passes_attempted_against','long_passes_completed_percent_against',
'assists_against', 'expected_assisted_goals_against', 'xa_against', 'assists_minus_expected_assisted_goals_against',
'key_passes_against', 'final_third_entering_passes_against', 'passes_into_18_yard_against',
'crosses_into_18_yard_against', 'progressive_passes_against']
passing_opponent_data = pd.DataFrame()
try:
for season_name, season_url in urls_opponent_passing.items():
page = requests.get(season_url)
passing_opponent = pd.read_html(season_url, match = "Squad Passing")[1]
passing_opponent = pd.DataFrame(passing_opponent)
passing_opponent.columns = passing_opponent.columns.droplevel()
passing_opponent.columns = passing_opponent_cols
passing_opponent['season_name'] = season_name
passing_opponent['name'] = 'passing_against'
passing_opponent_data = pd.concat([passing_opponent_data, passing_opponent], ignore_index = True)
time.sleep(20)
except HTTPError as error:
print(error)
print("The code could not be executed")
except ValueError:
print("The table could not be found in the URL provided! Check the table Title")
except requests.exceptions.ConnectionError as e:
print("Connection could not be established! Check the URL!")
except IndexError:
print("There was a list index out of range error!")
return(passing_opponent_data)
urls_opponent_passing = {'2022-2023': 'https://fbref.com/en/comps/9/Premier-League-Stats',
'2021-2022': 'https://fbref.com/en/comps/9/2021-2022/2021-2022-Premier-League-Stats',
'2020-2021': 'https://fbref.com/en/comps/9/2020-2021/2020-2021-Premier-League-Stats',
'2019-2020': 'https://fbref.com/en/comps/9/2019-2020/2019-2020-Premier-League-Stats',
'2018-2019': 'https://fbref.com/en/comps/9/2018-2019/2018-2019-Premier-League-Stats',
'2017-2018': 'https://fbref.com/en/comps/9/2017-2018/2017-2018-Premier-League-Stats'}
passing_opponent_data = get_vs_opponent_passing(urls_opponent_passing)
Running some unit tests:
class TestScraping(unittest.TestCase):
# Test whether the function exists:
def test_fun_exists(self):
self.assertIsNotNone(get_vs_opponent_passing)
print("The function exists!")
# Test whether the data frame is not None, so something has been returned
def test_data_frame_exists(self):
self.assertIsNotNone(passing_opponent_data)
print("The data frame exists!")
# Test whether the data frame has the correct number of rows:
def test_data_frame_rows_length(self):
self.assertGreaterEqual(len(passing_opponent_data), 100)
print("The data frame has some data, the scraping is successful!")
# Test whether the data frame has the correct number of columns:
def test_data_frame_columns_length(self):
self.assertGreaterEqual(len(passing_opponent_data.columns), 10)
print("The data frame has some columns, the scraping is successful!")
unittest.main(argv = ['ignored', '-v'], exit = False)
test_data_frame_columns_length (__main__.TestScraping) ... ok test_data_frame_exists (__main__.TestScraping) ... ok test_data_frame_rows_length (__main__.TestScraping) ... ok test_fun_exists (__main__.TestScraping) ...
The data frame has some columns, the scraping is successful! The data frame exists! The data frame has some data, the scraping is successful! The function exists!
ok ---------------------------------------------------------------------- Ran 4 tests in 0.003s OK
<unittest.main.TestProgram at 0x7ff37a44b760>
All the unit tests have passed and the scraping was successful.
Assess data types and missing values:
assess_data_frames(passing_opponent_data)
There are 120 rows and 28 columns There are no columns with missing values squad object players_used int64 90s_played float64 passes_completed_against int64 passes_attempted_against int64 passes_completed_percent_against float64 passes_distance_travelled_against int64 passes_progressive_distance_travelled_against int64 short_passes_completed_against int64 short_passes_attempted_against int64 short_passes_completed_percent_against float64 medium_passes_completed_against int64 medium_passes_attempted_against int64 medium_passes_completed_percent_against float64 long_passes_completed_against int64 long_passes_attempted_against int64 long_passes_completed_percent_against float64 assists_against int64 expected_assisted_goals_against float64 xa_against float64 assists_minus_expected_assisted_goals_against float64 key_passes_against int64 final_third_entering_passes_against int64 passes_into_18_yard_against int64 crosses_into_18_yard_against int64 progressive_passes_against int64 season_name object name object dtype: object
There are no missing values. All the columns have appropriate data types.
Assess outliers:
passing_opponent_data.apply(find_outliers_IQR, axis = 'rows')
squad None players_used [[]] 90s_played [[]] passes_completed_against [[]] passes_attempted_against [[]] passes_completed_percent_against [[68.9]] passes_distance_travelled_against [[]] passes_progressive_distance_travelled_against [[]] short_passes_completed_against [[]] short_passes_attempted_against [[]] short_passes_completed_percent_against [[]] medium_passes_completed_against [[]] medium_passes_attempted_against [[]] medium_passes_completed_percent_against [[76.8]] long_passes_completed_against [[]] long_passes_attempted_against [[]] long_passes_completed_percent_against [[]] assists_against [[]] expected_assisted_goals_against [[]] xa_against [[]] assists_minus_expected_assisted_goals_against [[-20.8]] key_passes_against [[]] final_third_entering_passes_against [[]] passes_into_18_yard_against [[]] crosses_into_18_yard_against [[]] progressive_passes_against [[]] season_name None name None dtype: object
From domain knowledge, none of the returned outliers are errors.
Write file to CSV:
passing_opponent_data.to_csv('passing_opponent_data.csv')
Performing some cleaning on the data including:
passing_opponent_data['squad'] = passing_opponent_data['squad'].str.strip('vs ')
passing_opponent_data = pd.merge(passing_opponent_data, goalkeeping_merged_data.loc[:, ['season_name', 'squad', 'ranking', 'table_position_category', 'possession', 'possession_75th_percentile', 'possession_50th_percentile', 'high_possession_team', 'high_press_team', 'pct_final_3rd_tackles']], on = ['squad', 'season_name'])
passing_opponent_data['season'] = passing_opponent_data['season_name'].str.split('-', expand = True)[0]
Filter to get data only for how teams are passing against high press teams:
vs_high_press_passing = passing_opponent_data.query("high_press_team == 'High Pressing Team' ")
vs_high_press_passing = vs_high_press_passing.sort_values(by = 'season')
vs_high_press_passing.head(3)
| squad | players_used | 90s_played | passes_completed_against | passes_attempted_against | passes_completed_percent_against | passes_distance_travelled_against | passes_progressive_distance_travelled_against | short_passes_completed_against | short_passes_attempted_against | ... | name | ranking | table_position_category | possession | possession_75th_percentile | possession_50th_percentile | high_possession_team | high_press_team | pct_final_3rd_tackles | season | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 111 | Tottenham | 25 | 38.0 | 10279 | 14649 | 70.2 | 194241 | 82725 | 4476 | 5395 | ... | passing_against | 3 | Top 6 | 61.8 | 54.925 | 46.85 | High Possession Team | High Pressing Team | 14.077670 | 2017 |
| 105 | Manchester City | 25 | 38.0 | 8395 | 12187 | 68.9 | 157327 | 68682 | 3748 | 4498 | ... | passing_against | 1 | Top 6 | 71.0 | 54.925 | 46.85 | High Possession Team | High Pressing Team | 16.163410 | 2017 |
| 104 | Liverpool | 27 | 38.0 | 11830 | 16343 | 72.4 | 215503 | 87782 | 5348 | 6298 | ... | passing_against | 4 | Top 6 | 60.3 | 54.925 | 46.85 | High Possession Team | High Pressing Team | 13.657771 | 2017 |
3 rows × 37 columns
i. Check how long and short passes attempted made per season against high press teams are trending?
f, ax = plt.subplots(1,2, figsize = (18,10))
pressing_plot_6 = sns.lineplot(data = vs_high_press_passing.query("season_name != '2022-2023'"), x = 'season', y = 'long_passes_attempted_against', ci = None, estimator = np.mean, linewidth = 5, ax = ax[0])
pressing_plot_7 = sns.lineplot(data = vs_high_press_passing.query("season_name != '2022-2023'"), x = 'season', y = 'short_passes_attempted_against', ci = None, estimator = np.mean, linewidth = 5, ax = ax[1])
pressing_plot_6.set_title('Long Passes Attempted per Season against High Press Teams:',fontsize = 16, loc = 'center',fontweight = 'bold',pad = 15)
pressing_plot_6.set_xlabel('Season', fontsize = 15, labelpad = 15)
pressing_plot_6.set_ylabel('Long Passes Attempted', fontsize = 15, labelpad = 10)
pressing_plot_7.set_title('Short Passes Attempted per Season against High Press Teams',fontsize = 16, loc = 'center',fontweight = 'bold',pad = 15)
pressing_plot_7.set_xlabel('Season', fontsize = 15, labelpad = 15)
pressing_plot_7.set_ylabel('Short Passes Attempted', fontsize = 15, labelpad = 10)
plt.show()
In the same vein as the previous question's finding, teams seem to have become braver against the high press as seasons have progressed. The number of short passes against the high press has increased significantly, while the number of long passes against the high press has decreased significantly.
This will be done through cluster analysis using the K-means clustering method.
Brief description of the K-means clustering method : K-means clustering is one of the simplest clustering algorithms The algorithm attempts to partition the data into K groups, where K can be defined the by the user. Each data point only belongs to one of the K groups. The clustering is done to maximise the similarity of data points within the cluster. Clustering is an unsupervised method of machine learning.
Watch this video about K-means by the brilliant Josh Starmer to get more details about K-means clustering [13].
The idea is to utilise clustering as an automated means of identifying and grouping the teams based on their pressing category.
Deciding the data and by extension, the factors that will be utilised to cluster teams together is important. Here is the data that will be used for the clustering process and the reasoning behind their usage:
Deciding the exact columns to be used from each of the data frames mentioned above and storing them in lists:
def_clustering_columns = ['ranking','season', 'tackles_made', 'tackles_won','tackles_in_defensive_3rd',
'tackles_in_middle_3rd', 'tackles_in_attacking_3rd', 'tackles_plus_interceptions',
'pct_final_3rd_tackles', 'xg_against', 'non_penalty_xg_against', 'sca']
adv_goalkeeping_columns = ['ranking', 'season', 'post_shot_xg_against', 'post_shot_xg_per_shot_on_target_against',
'launched_passes_attempted_against', 'passes_launched_percent_against',
'pass_avg_length_against', 'goal_kicks_launched_percent_against',
'goal_kicks_avg_length_against']
passing_opponent_columns = ['ranking','season', 'passes_completed_percent_against',
'passes_distance_travelled_against', 'short_passes_completed_percent_against',
'medium_passes_completed_percent_against', 'long_passes_completed_percent_against',
]
adv_goalkeeping_against_data['season'] = adv_goalkeeping_against_data['season'].astype('int')
passing_opponent_data['season'] = passing_opponent_data['season'].astype('int')
Merging the three data frames mentioned above to create a master data frame for clustering:
First, merge defensive actions data and advanced goalkeeping data. Then merge the passing_opponent data. The lists created above for specific columns come in handy.
pressing_cluster_data = pd.merge(defensive_actions_data[def_clustering_columns], adv_goalkeeping_against_data[adv_goalkeeping_columns],
on = ['season', 'ranking'])
pressing_cluster_data = pd.merge(pressing_cluster_data, passing_opponent_data[passing_opponent_columns],
on = ['season', 'ranking'])
Now, the data needs to be scaled. The K-means clustering algorithm clusters data points together based on their distances from each other. The points that are close together in distance are clustered together. When different columns in the data, have widely different ranges, the K-means clustering algorithm might give subpar results. Therefore, it is standard practice to scale all the columns of the data to be in the same range. The StandardScaler from the scikit learn library will be used for this purpose. This function scales the data by subtracting the mean and dividing it by the standard deviation [14].
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
scaler = StandardScaler()
Scaling all columns, except the season and ranking columns, using the StandardScaler:
pressing_cluster_data[['tackles_made', 'tackles_won',
'tackles_in_defensive_3rd', 'tackles_in_middle_3rd',
'tackles_in_attacking_3rd', 'tackles_plus_interceptions',
'pct_final_3rd_tackles', 'xg_against', 'non_penalty_xg_against', 'sca',
'post_shot_xg_against', 'post_shot_xg_per_shot_on_target_against',
'launched_passes_attempted_against', 'passes_launched_percent_against',
'pass_avg_length_against', 'goal_kicks_launched_percent_against',
'goal_kicks_avg_length_against', 'passes_completed_percent_against',
'passes_distance_travelled_against',
'short_passes_completed_percent_against',
'medium_passes_completed_percent_against',
'long_passes_completed_percent_against']] = scaler.fit_transform(pressing_cluster_data[['tackles_made', 'tackles_won',
'tackles_in_defensive_3rd', 'tackles_in_middle_3rd',
'tackles_in_attacking_3rd', 'tackles_plus_interceptions',
'pct_final_3rd_tackles', 'xg_against', 'non_penalty_xg_against', 'sca',
'post_shot_xg_against', 'post_shot_xg_per_shot_on_target_against',
'launched_passes_attempted_against', 'passes_launched_percent_against',
'pass_avg_length_against', 'goal_kicks_launched_percent_against',
'goal_kicks_avg_length_against', 'passes_completed_percent_against',
'passes_distance_travelled_against',
'short_passes_completed_percent_against',
'medium_passes_completed_percent_against',
'long_passes_completed_percent_against']])
Applying the K-means algorithm on the data:
K has been chosen to be three. So that we get 3 clusters of teams based on their pressing characterictics- a high/effective pressing teams cluster, a low/ineffective pressing teams cluster and another cluster for the teams in between.
km = KMeans(n_clusters = 3, init = 'random', n_init = 100, random_state = 730)
pressing_cluster_results = km.fit_predict(pressing_cluster_data)
Assess clustering results using the silhouette score:
from sklearn.metrics import silhouette_samples, silhouette_score
silhouette_avg = silhouette_score(pressing_cluster_data, pressing_cluster_results)
silhouette_avg
0.3699432698524288
The Silhouette score gives an interpretable measure of how close points in one cluster are to points in other clusters. The score ranges from negative one (-1) to positive one (+1). Negative values indicate extremely poor cluster performance and show that points are in the wrong clusters. Positive values indicate correct cluster assignments. A value close to positive one indicates that points are correctly clustered and that points in one cluster are significantly different to points in other clusters. Read about Silhoutte score here
The Silhouette score of 0.36 indicates scope for improvement in the clustering, because points in each cluster are not significantly different to points in other clusters.
Using a silhouette plot to visualise the clusters:
from yellowbrick.cluster import SilhouetteVisualizer
plt.figure(figsize = (10,8))
sil_visualizer = SilhouetteVisualizer(km)
sil_visualizer.fit(pressing_cluster_data)
sil_visualizer.show()
/Users/pradyumnadavuloori/anaconda3/lib/python3.8/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but KMeans was fitted with feature names warnings.warn(
<AxesSubplot: title={'center': 'Silhouette Plot of KMeans Clustering for 115 Samples in 3 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
There are a few data points in cluster 1 that have a negative silhouette score, indicating that these belong to another cluster.
Merge the cluster results back to the defensive performance data frame to assess results:
The season and ranking columns are in the data and have been left unscaled and have not been utilised for clustering.
pressing_cluster_data['pressing_cluster'] = pressing_cluster_results
defensive_actions_data = pd.merge(defensive_actions_data, pressing_cluster_data.loc[:, ['season', 'ranking', 'pressing_cluster']], on = ['season', 'ranking'])
Here is the plan for assessing the clusters:
Checking what cluster Top 6, Mid Table and Bottom 5 teams have been placed into:
defensive_actions_data.groupby('table_position_category')['pressing_cluster'].value_counts(normalize = True)*100
table_position_category pressing_cluster
Bottom 5 0 86.206897
2 13.793103
Mid Table 0 54.000000
1 28.000000
2 18.000000
Top 6 1 100.000000
Name: pressing_cluster, dtype: float64
All the Top 6 teams are exclusively in cluster 1. Bottom 5 teams are almost exclusively in cluster 0. Mid table teams are spread out between the three clusters.
Check the breakdown of the clusters in terms of pressing categories previously allotted to teams:
defensive_actions_data.groupby('high_press_team')['pressing_cluster'].value_counts(normalize = True)*100
high_press_team pressing_cluster
Average High Press Team 0 50.000000
1 40.000000
2 10.000000
High Pressing Team 1 80.000000
0 13.333333
2 6.666667
Non High Pressing Team 0 60.000000
1 25.454545
2 14.545455
Name: pressing_cluster, dtype: float64
80 percent of the High pressing teams are in Cluster 1. 60% i.e the majority of non high pressing teams are in cluster 0. The average pressing teams are split beween cluster 0 and 1.
Defining a palette for the pressing clusters:
palette_press_clusters = "tab10"
sns.set_style('white')
Analyse percentage of final 3rd tackles versus the pressing clusters:
plt.figure(figsize = (10,10))
sns.violinplot(data = defensive_actions_data, x = 'pressing_cluster', y = 'pct_final_3rd_tackles', palette = palette_press_clusters)
plt.title('Percentage of Final third tackles: for Passing Clusters', fontsize = 16, loc = 'center',fontweight = 'bold',pad = 15)
#plt.xlim([0,100])
plt.ylabel('Percentage of Final third tackles', fontsize = 15, labelpad = 15)
plt.xlabel("Pressing Cluster", fontsize = 15, labelpad = 15)
plt.xticks(fontsize = 15, fontweight = 'bold')
plt.show()
The violin plots, predominantly, conform to the cluster assignment results.
Now checking passes_completed_percent_against and passes_launched_percent_against:
First, merging the cluster results to passing_opponent_data data frame which has the required columns.
passing_opponent_data = pd.merge(passing_opponent_data, pressing_cluster_data.loc[:, ['season', 'ranking', 'pressing_cluster']], on = ['season', 'ranking'])
adv_goalkeeping_against_data = pd.merge(adv_goalkeeping_against_data, pressing_cluster_data.loc[:, ['season', 'ranking', 'pressing_cluster']], on = ['season', 'ranking'])
Now plot the cluster results against the mean pass completion percentage against: The idea here is to see if the clustering has placed teams who are aggressive in their pressure all over the pitch, not just in the final third, together.
The pass completion percentage against a team being can spot a team who is aggressive all over the pitch in trying to win the ball back, therefore forcing the opponent into erroneous passes.
plt.figure(figsize = (10,10))
press_cluster_plot_2 = sns.barplot(data = passing_opponent_data, x = 'passes_completed_percent_against', y = 'pressing_cluster',ci = None,orient = 'h', estimator = np.mean, palette = palette_press_clusters)
plt.title('Mean Pass Completion Percentage Against: for Passing Clusters', fontsize = 16, loc = 'center',fontweight = 'bold',pad = 15)
plt.xlim([0,100])
plt.ylabel('Pressing Cluster', fontsize = 15, labelpad = 15)
plt.xlabel("Pass Completion Percentage Against", fontsize = 15, labelpad = 15)
plt.yticks(fontsize = 15, fontweight = 'bold')
press_cluster_plot_2.bar_label(press_cluster_plot_2.containers[0])
plt.show()
Teams in cluster 1 have the lowest pass completion percentage against, lending more credence to this being the cluster for aggressive pressing teams.
Checking the cluster results vs the percentage of passes launched:
plt.figure(figsize = (10,8))
press_cluster_plot_3 = sns.barplot(data = adv_goalkeeping_against_data, x = 'passes_launched_percent_against', y = 'pressing_cluster',orient = 'h', ci = None,estimator = np.mean, palette = palette_press_clusters)
plt.title('Mean Passes Launched % by Goalkeepers: for Pressing Clusters', fontsize = 16, loc = 'center',fontweight = 'bold',pad = 15)
plt.xlim([0,80])
plt.ylabel('Pressing Cluster', fontsize = 15, labelpad = 15)
plt.xlabel("Passes Launched % by Goalkeepers", fontsize = 15, labelpad = 15)
press_cluster_plot_3.bar_label(press_cluster_plot_3.containers[0])
plt.yticks(fontsize = 15, fontweight = 'bold')
plt.show()
Teams in cluster 1 have the highest percentage of passes launched against, by opponent goalkeepers.
Checking cluster results vs percentage of goal kicks launched :
plt.figure(figsize = (10,8))
press_cluster_plot_4 = sns.barplot(data = adv_goalkeeping_against_data, x = 'goal_kicks_launched_percent_against', y = 'pressing_cluster',orient = 'h',ci = None ,estimator = np.mean, palette = palette_press_clusters)
plt.title('Mean % of Goal Kicks Launched: for Pressing Clusters', fontsize = 16, loc = 'center',fontweight = 'bold',pad = 15)
plt.xlim([0,80])
plt.ylabel('Pressing Cluster', fontsize = 15, labelpad = 15)
plt.xlabel("% of Goal Kicks Launched", fontsize = 15, labelpad = 15)
plt.yticks(fontsize = 15, fontweight = 'bold')
press_cluster_plot_4.bar_label(press_cluster_plot_4.containers[0])
plt.show()
Teams in Cluster 1 have the highest percentage of goalkicks launched against them.
Let us look at the teams in cluster 1 now, to identify effective pressing teams:
Cluster 1:
defensive_actions_data.query("pressing_cluster == 1 & season_name != '2022-2023' ")[['squad','season_name','pressing_cluster','tackles_in_attacking_3rd', 'pct_final_3rd_tackles']].sort_values(by = 'pct_final_3rd_tackles', ascending = False).head(15)
| squad | season_name | pressing_cluster | tackles_in_attacking_3rd | pct_final_3rd_tackles | |
|---|---|---|---|---|---|
| 57 | Liverpool | 2019-2020 | 1 | 112 | 20.363636 |
| 19 | Manchester City | 2021-2022 | 1 | 101 | 20.281124 |
| 58 | Manchester City | 2019-2020 | 1 | 100 | 19.455253 |
| 20 | Liverpool | 2021-2022 | 1 | 107 | 19.314079 |
| 40 | Liverpool | 2020-2021 | 1 | 100 | 19.011407 |
| 38 | Manchester City | 2020-2021 | 1 | 88 | 17.670683 |
| 76 | Manchester City | 2018-2019 | 1 | 89 | 17.181467 |
| 95 | Manchester City | 2017-2018 | 1 | 91 | 16.163410 |
| 27 | Brighton | 2021-2022 | 1 | 105 | 15.718563 |
| 23 | Arsenal | 2021-2022 | 1 | 84 | 15.555556 |
| 78 | Chelsea | 2018-2019 | 1 | 93 | 15.048544 |
| 42 | Leicester City | 2020-2021 | 1 | 99 | 14.537445 |
| 45 | Arsenal | 2020-2021 | 1 | 66 | 14.473684 |
| 97 | Tottenham | 2017-2018 | 1 | 87 | 14.077670 |
| 59 | Manchester Utd | 2019-2020 | 1 | 81 | 13.989637 |
Cluster 1 can be identified as the effective high pressing teams. These teams have a high percentage of their tackles in the final third.
Looking at the percentage of passes completed against each of these teams:
passing_opponent_data.query("season_name != '2022-2023'")[['squad', 'season_name', 'pressing_cluster' ,'passes_completed_percent_against']].sort_values(by = 'passes_completed_percent_against', ascending = True).head(10)
| squad | season_name | pressing_cluster | passes_completed_percent_against | |
|---|---|---|---|---|
| 105 | Manchester City | 2017-2018 | 1 | 68.9 |
| 111 | Tottenham | 2017-2018 | 1 | 70.2 |
| 66 | Liverpool | 2019-2020 | 1 | 71.7 |
| 104 | Liverpool | 2017-2018 | 1 | 72.4 |
| 72 | Southampton | 2019-2020 | 0 | 72.6 |
| 29 | Liverpool | 2021-2022 | 1 | 72.8 |
| 87 | Liverpool | 2018-2019 | 1 | 73.2 |
| 88 | Manchester City | 2018-2019 | 1 | 73.4 |
| 46 | Leeds United | 2020-2021 | 1 | 73.9 |
| 95 | Arsenal | 2017-2018 | 1 | 74.1 |
Teams in cluster 1 make up 9 out of the top 10 for teams who force their opponents into errant passes through their aggressive pressure, therefore leading to a bad pass completion percentage against. These are the effecive pressing teams.
A comprehensive analysis of the evolution of English football over the past 5 years has been done, through data analysis. Goalkeeping, defending, passing, possession and pressing were analysed. An approach was taken, to combine domain knowledge of football to frame hypotheses and then test the hypotheses through data. Here are some of the interesting findings of the report:
i. Passing by goalkeepers has evolved. Goalkeepers increasingly look to utilise shorter passes, both from goal kicks and open play. Subsequently, the amount of passes and goal kicks they launch long has reduced. Long passes by goalkeepers seem to have become a weapon to add variation to the attacks or the last resort when being pressed intensely. Shorter passes have become the go-to, and a strong correlation was found between goalkeepers using short passes and high possession figures.
ii. The percentage of tackles made by teams in the final third increased, pointing to high pressing becoming an important tactic.
iii. Interesting trends were found in the role of both midfielders and defenders. It was found that defenders have made more passes as seasons progressed, but the number of progressive passes they made has decreased.
iv. About midfielders, a reduction was found in the number of passes, progressive passes and tackles plus interceptions they made, as seasons progressed.
v. Against effective high pressing teams, opponents have become braver in their attempts to play out from the back with their goalkeepers through shorter passes. In open play as well, opponents are using more short passes against high pressing teams. With pressing gaining increased focus and effectiveness, opponents have been forced to adapt.
vi. Through a clustering process, effective high pressing teams were identified and placed in a group. Though the performance metrics of clustering showed scope for improvement, analysis of the clusters showed a homogenous grouping of teams, especially the effective and aggressive pressing teams.
This report contains a unique blend of football and data analytics. The tactics and evolving trends of English football over the past five years have been analysed comprehensively. Possession and pressing were the key themes of the report, mirroring the fact that they have become key themes in English football over the past few years. The report was successful in uncovering some fascinating insights about the changes happening in English football. The report also threw up interesting questions with scope for future analysis, including:
[1] M. Carey, The Athletic, December 2022. [Online]. Available: https://theathletic.com/4013021/2022/12/19/world-cup-2022-tactical-trends/.
[2] "Givemesport," [Online]. Available: https://www.givemesport.com/1830305-thierry-henry-when-pep-guardiola-hauled-him-off-for-ignoring-barcelona-tactics.
[3] "Statology," August 2021. [Online]. Available: https://www.statology.org/correlation-does-not-imply-causation-examples/.
[4] Unknown, "Soccerblade," July 2022. [Online]. Available: https://soccerblade.com/how-soccer-has-changed/.
[5] Daniel Taylor,"The Athletic", ,May 2022. [Online]. Available: https://theathletic.com/3327242/2022/05/24/guardiola-premier-league-change/.
[6] Bill Connelly, "ESPN," April 2020. [Online]. Available: https://www.espn.com/soccer/english-premier-league/story/4086497/how-soccer-has-changed-in-the-past-10-years-from-mourinhos-peak-to-reign-of-super-clubs.
[7] A. Clarke, "Premier League," June 2022. [Online]. Available: https://www.premierleague.com/news/2641650.
[8] "Fbref," [Online]. Available: https://fbref.com/en/.
[9] "rookieroad," [Online]. Available: https://www.rookieroad.com/soccer/defensive-third/.
[10] "spielverlagerung," [Online]. Available: https://spielverlagerung.com/glossary/tactical-methods/pressing/.
[11] Michael Cox, "The Athletic," August 2019. [Online]. Available: https://theathletic.com/1110806/2019/08/07/why-the-goal-kick-law-change-should-help-teams-play-out-from-the-back/.
[12] L. Tharme, "The Athletic," 2022. [Online]. Available: https://theathletic.com/3459692/2022/08/11/european-football-through-ball-dying-art/.
[13] J. Starmer, "youtube," [Online]. Available: https://www.youtube.com/watch?v=4b5d3muPQmA.
[14] "scikit-learn," [Online]. Available: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html.
pip freeze > requirements.txt
Note: you may need to restart the kernel to use updated packages.